How-to guides: Evaluation
This section contains how-to guides related to evaluation.
📄️ Evaluate an LLM Application
Before diving into this content, it might be helpful to read the following:
📄️ Bind an evaluator to a dataset in the UI
While you can specify evaluators to grade the results of your experiments programmatically (see this guide for more information), you can also bind evaluators to a dataset in the UI.
📄️ Run an evaluation from the prompt playground
While you can kick off experiments easily using the sdk, as outlined here, it's often useful to run experiments directly in the prompt playground.
📄️ Evaluate on intermediate steps
While, in many scenarios, it is sufficient to evaluate the final output of your task, in some cases you might want to evaluate the intermediate steps of your pipeline.
📄️ Use LangChain off-the-shelf evaluators (Python only)
Before diving into this content, it might be helpful to read the following:
📄️ Compare experiment results
Oftentimes, when you are iterating on your LLM application (such as changing the model or the prompt), you will want to compare the results of different experiments.
📄️ Evaluate an existing experiment
Currently, evaluate_existing is only supported in the Python SDK.
📄️ Test LLM applications (Python only)
LangSmith functional tests are assertions and expectations designed to quickly identify obvious bugs and regressions in your AI system. Relative to evaluations, tests typically are designed to be fast and cheap to run, focusing on specific functionality and edge cases.
📄️ Run pairwise evaluations
Before diving into this content, it might be helpful to read the following:
📄️ Audit evaluator scores
LLM-as-a-judge evaluators don't always get it right. Because of this, it is often useful for a human to manually audit the scores left by an evaluator and correct them where necessary. LangSmith allows you to make corrections on evaluator scores in the UI or SDK.
📄️ Create few-shot evaluators
Using LLM-as-a-Judge evaluators can be very helpful when you can't evaluate your system programmatically. However, improving/iterating on these prompts can add unnecessary
📄️ Fetch performance metrics for an experiment
Tracing projects and experiments use the same underlying data structure in our backend, which is called a "session."