Update Tool Call Accuracy to output unified format#46319
Update Tool Call Accuracy to output unified format#46319
Conversation
…matting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
| # Check for intermediate response | ||
| if _is_intermediate_response(eval_input.get("response")): | ||
| return self._not_applicable_result( | ||
| return self._return_not_applicable_result( |
There was a problem hiding this comment.
I am not very sure this is n/a.. feels like caller did something wrong in preparing the data. To me it is more like invalid input rather than not applicable case, where the inputs are perfectly normal, but evaluator cannot score because of some reason
There was a problem hiding this comment.
This case can happen if the user provided a normal response id or normal data but the response is still not complete yet, waiting for input from user to approve a tool call. I am not sure we should return invalid input when given a response id.
| f"gpt_{self._result_key}": score, | ||
| f"{self._result_key}_score": score, | ||
| f"{self._result_key}_result": score_result, | ||
| f"{self._result_key}_passed": score_result == "pass", |
There was a problem hiding this comment.
we shouldn't need to compute pass here. Elaine's PR will handle pass/fail based on result
There was a problem hiding this comment.
@aprilk-ms some evaluators already return a _passed bool - SDK currently does multiple checks on evaluator response like _result == "passed", _label == "passed", _passed == True
-> we want to standardize so we only have to check if _passed
evaluator itself is already calculating pass/fail because the threshold setting (e.g. if score > threshold -> pass vs score < threshold -> pass) is stored in evaluator class. makes sense to just propagate from there
There was a problem hiding this comment.
I updated the dependencies in azure-ai-evaluation on _result to use _passed in #46436 . However, I think we should keep both for now till we deploy the service changes, to avoid breaking changes.
…ed properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…outputs (#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Description
Update Tool Call Accuracy to output unified format
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines