Run results show how your functions performed against test cases and evaluation criteria. Result views are available in the AIP Evals application or the integrated AIP Evals sidebar in AIP Logic and AIP Agent Studio.
If you have configured pass criteria on your evaluators, AIP Evals will automatically determine a Passed
or Failed
status for each test case. The results page displays the overall pass percentage across all test cases.
In some cases, you may want to investigate a specific test case result further. For these cases, the debug view is available. This view provides execution traces, input/output data, and error messages for individual test cases so you can understand your function outputs and evaluator results.
There are multiple ways to open the debug view for a test case. You can do it from AIP Evals, AIP Logic, or AIP Agent Studio.
The debug view provides detailed information about test function execution and evaluator results. It allows you to:
Evaluation functions that are backed by AIP Logic, like the out-of-the-box provided Rubric grader or Contains key details evaluators allow access to the native Logic debugger. This helps you understand why the evaluation produced a specific result which is particularly helpful when using an LLM-as-a-judge evaluator.
In the example shown in the screenshot below, the rubric grader evaluator did not pass, because the result of 8
did not cross the defined minimum threshold of 9
. Looking into the Logic debugger, we can see that the LLM judge only awarded 8
points because the response was wrapped in quotation marks. To earn a higher score, we will need to improve our prompt.