Skip to content

Update Tool Call Accuracy to output unified format#46319

Draft
m7md7sien wants to merge 13 commits intomainfrom
mohessie/unify_output/tool_call_accuracy
Draft

Update Tool Call Accuracy to output unified format#46319
m7md7sien wants to merge 13 commits intomainfrom
mohessie/unify_output/tool_call_accuracy

Conversation

@m7md7sien
Copy link
Copy Markdown
Contributor

@m7md7sien m7md7sien commented Apr 14, 2026

Description

Update Tool Call Accuracy to output unified format

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 14, 2026
# Check for intermediate response
if _is_intermediate_response(eval_input.get("response")):
return self._not_applicable_result(
return self._return_not_applicable_result(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure this is n/a.. feels like caller did something wrong in preparing the data. To me it is more like invalid input rather than not applicable case, where the inputs are perfectly normal, but evaluator cannot score because of some reason

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case can happen if the user provided a normal response id or normal data but the response is still not complete yet, waiting for input from user to approve a tool call. I am not sure we should return invalid input when given a response id.

f"gpt_{self._result_key}": score,
f"{self._result_key}_score": score,
f"{self._result_key}_result": score_result,
f"{self._result_key}_passed": score_result == "pass",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need to compute pass here. Elaine's PR will handle pass/fail based on result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aprilk-ms some evaluators already return a _passed bool - SDK currently does multiple checks on evaluator response like _result == "passed", _label == "passed", _passed == True
-> we want to standardize so we only have to check if _passed

evaluator itself is already calculating pass/fail because the threshold setting (e.g. if score > threshold -> pass vs score < threshold -> pass) is stored in evaluator class. makes sense to just propagate from there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the dependencies in azure-ai-evaluation on _result to use _passed in #46436 . However, I think we should keep both for now till we deploy the service changes, to avoid breaking changes.

Copilot AI and others added 7 commits April 16, 2026 22:10
…ed properties handling (#46355)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…outputs (#46449)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants