Evaluation
Systematically measuring whether an AI workflow produces good outputs.
Definition
Systematically measuring whether an AI workflow produces good outputs. Evaluation can be human review, automated checks, or a second model grading the first. Without evaluation you cannot tell whether a change to a prompt, model, or workflow made things better or worse.
Example
Before swapping the model behind your sales-email generator, you run 50 historical leads through both versions and have a reviewer rate the drafts. The new model wins on 38 of 50 — now you can ship the swap with evidence.
See it in context Learn how Evaluation fits into the bigger picture of how software actually works.
Read the Guide →