Run results show how your functions performed against test cases and evaluation criteria. Result views are available in the AIP Evals application or the integrated AIP Evals sidebar in AIP Logic and AIP Agent Studio.
If you have configured pass criteria on your evaluators, AIP Evals will automatically determine a Passed or Failed status for each test case. The results page displays the overall pass percentage across all test cases.
In some cases, you may want to investigate a specific test case result further. For these cases, the debug view is available. This view provides execution traces, input/output data, and error messages for individual test cases so you can understand your function outputs and evaluator results.
There are multiple ways to open the debug view for a test case. You can do it from AIP Evals, AIP Logic, or AIP Agent Studio.


The debug view provides detailed information about test function execution and evaluator results. It allows you to:


Custom function evaluators can return string values alongside their metric outputs. These strings appear as Debug outputs in the evaluator tab, providing additional context such as reasoning, intermediate values, or diagnostic information.

Evaluation functions that are backed by AIP Logic, like the out-of-the-box provided Rubric grader or Contains key details evaluators allow access to the native Logic debugger. This helps you understand why the evaluation produced a specific result which is particularly helpful when using an LLM-as-a-judge evaluator.
In the example shown in the screenshot below, the rubric grader evaluator did not pass, because the result of 8 did not cross the defined minimum threshold of 9. Looking into the Logic debugger, we can see that the LLM judge only awarded 8 points because the response was wrapped in quotation marks. To earn a higher score, we will need to improve our prompt.

The AIP Evals results analyzer allows you to quickly understand why test cases failed and how to fix them using large language models. It automatically clusters failures into root cause categories and, when appropriate, proposes targeted prompt changes.
Use the results analyzer to:
You can use the results analyzer from either AIP Evals or AIP Logic.
From AIP Evals:

From AIP Logic:

You can optionally configure:
The Analyzer displays a root cause analysis card with an overview and category tabs.
Use the Filter table option to focus the results table on a category or single example test case, allowing you to drill down further with the debug view or evaluator details.
For AIP Logic functions, each category can include a Suggested prompt improvement that you can copy with one click. Suggestions call out:
Review suggestions and apply changes in your Logic function’s prompts where appropriate, then re‑run your evaluation suites to validate improvements.