LLMs are powerful, but can be unpredictable due to their non-deterministic nature. To put LLM-backed workflows into production or make changes to an existing implementation, confidence is essential.
AIP Evals helps you build that confidence by providing the means to evaluate your LLM-based functions and prompts. You can use AIP Evals to:
Create test cases and define evaluation criteria.
Debug, iterate, and improve functions and prompts.
Compare the performance of different models on your functions.
Examine variance across multiple runs.
Core concepts
Evaluation suite: The collection of test cases and evaluation functions used to benchmark function performance.
Evaluation function: The method used when comparing or evaluating the actual output of a function against the expected output(s).
Test cases: Defined sets of inputs and expected outputs that are passed into evaluation functions during evaluation suite runs.
Metrics: The results of evaluation functions. Metrics are produced per test case and can be compared in aggregate or individually between runs.