评估
langsmith.evaluation ¶
评估助手函数。
| 函数 | 描述 |
|---|---|
aevaluate |
在给定数据集上评估一个异步目标系统。 |
aevaluate_existing |
异步评估现有的实验运行。 |
evaluate |
在给定数据集上评估一个目标系统。 |
evaluate_comparative |
相互比较评估现有的实验运行。 |
evaluate_existing |
评估现有的实验运行。 |
run_evaluator |
从一个函数创建一个运行评估器。 |
EvaluationResult ¶
基类: BaseModel
评估结果。
| 方法 | 描述 |
|---|---|
check_value_non_numeric |
检查值是否为非数值。 |
evaluator_info class-attribute instance-attribute ¶
关于评估器的附加信息。
feedback_config class-attribute instance-attribute ¶
feedback_config: FeedbackConfig | dict | None = None
用于生成此反馈的配置。
source_run_id class-attribute instance-attribute ¶
评估器本身追踪的ID。
target_run_id class-attribute instance-attribute ¶
此评估所应用的追踪的ID。
如果未提供,评估反馈将应用于根追踪。
Config ¶
Pydantic 模型配置。
RunEvaluator ¶
评估器接口类。
| 方法 | 描述 |
|---|---|
evaluate_run |
评估一个示例。 |
aevaluate_run |
异步评估一个示例。 |
evaluate_run abstractmethod ¶
evaluate_run(
run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults
评估一个示例。
aevaluate_run async ¶
aevaluate_run(
run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults
异步评估一个示例。
LangChainStringEvaluator ¶
用于包装 LangChain StringEvaluator 的类。
需要安装 langchain 包。
| 属性 | 描述 |
|---|---|
evaluator |
底层的 StringEvaluator 或要加载的评估器的名称。
类型: |
| 方法 | 描述 |
|---|---|
as_run_evaluator |
将 LangChainStringEvaluator 转换为 RunEvaluator。 |
示例
创建一个简单的 LangChainStringEvaluator
将 LangChainStringEvaluator 转换为 RunEvaluator
>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> evaluator = LangChainStringEvaluator(
... "criteria",
... config={
... "criteria": {
... "usefulness": "The prediction is useful if"
... " it is correct and/or asks a useful followup question."
... },
... "llm": ChatOpenAI(model="gpt-4o"),
... },
... )
>>> run_evaluator = evaluator.as_run_evaluator()
>>> run_evaluator
<DynamicRunEvaluator ...>
自定义评估器使用的 LLM 模型
>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_anthropic import ChatAnthropic
>>> evaluator = LangChainStringEvaluator(
... "criteria",
... config={
... "criteria": {
... "usefulness": "The prediction is useful if"
... " it is correct and/or asks a useful followup question."
... },
... "llm": ChatAnthropic(model="claude-3-opus-20240229"),
... },
... )
>>> run_evaluator = evaluator.as_run_evaluator()
>>> run_evaluator
<DynamicRunEvaluator ...>
使用 `evaluate` API 与不同的评估器
>>> def prepare_data(run: Run, example: Example):
... # Convert the evaluation data into the format expected by the evaluator
... # Only required for datasets with multiple inputs/output keys
... return {
... "prediction": run.outputs["prediction"],
... "reference": example.outputs["answer"],
... "input": str(example.inputs),
... }
>>> import re
>>> from langchain_anthropic import ChatAnthropic
>>> import langsmith
>>> from langsmith.evaluation import LangChainStringEvaluator, evaluate
>>> criteria_evaluator = LangChainStringEvaluator(
... "criteria",
... config={
... "criteria": {
... "usefulness": "The prediction is useful if it is correct"
... " and/or asks a useful followup question."
... },
... "llm": ChatAnthropic(model="claude-3-opus-20240229"),
... },
... prepare_data=prepare_data,
... )
>>> embedding_evaluator = LangChainStringEvaluator("embedding_distance")
>>> exact_match_evaluator = LangChainStringEvaluator("exact_match")
>>> regex_match_evaluator = LangChainStringEvaluator(
... "regex_match", config={"flags": re.IGNORECASE}, prepare_data=prepare_data
... )
>>> scoring_evaluator = LangChainStringEvaluator(
... "labeled_score_string",
... config={
... "criteria": {
... "accuracy": "Score 1: Completely inaccurate\nScore 5: Somewhat accurate\nScore 10: Completely accurate"
... },
... "normalize_by": 10,
... "llm": ChatAnthropic(model="claude-3-opus-20240229"),
... },
... prepare_data=prepare_data,
... )
>>> string_distance_evaluator = LangChainStringEvaluator(
... "string_distance",
... config={"distance_metric": "levenshtein"},
... prepare_data=prepare_data,
... )
>>> from langsmith import Client
>>> client = Client()
>>> results = evaluate(
... lambda inputs: {"prediction": "foo"},
... data=client.list_examples(dataset_name="Evaluate Examples", limit=1),
... evaluators=[
... embedding_evaluator,
... criteria_evaluator,
... exact_match_evaluator,
... regex_match_evaluator,
... scoring_evaluator,
... string_distance_evaluator,
... ],
... )
View the evaluation results for experiment:...
__init__ ¶
__init__(
evaluator: StringEvaluator | str,
*,
config: dict | None = None,
prepare_data: Callable[[Run, Optional[Example]], SingleEvaluatorInput]
| None = None,
)
初始化一个 LangChainStringEvaluator。
| 参数 | 描述 |
|---|---|
evaluator
|
底层的 StringEvaluator。
类型: |
as_run_evaluator ¶
as_run_evaluator() -> RunEvaluator
将 LangChainStringEvaluator 转换为 RunEvaluator。
这是 LangSmith `evaluate` API 中使用的对象。
| 返回 | 描述 |
|---|---|
RunEvaluator
|
转换后的 RunEvaluator。
类型: |
aevaluate async ¶
aevaluate(
target: ATARGET_T | AsyncIterable[dict] | Runnable | str | UUID | TracerSession,
/,
data: DATA_T | AsyncIterable[Example] | Iterable[Example] | None = None,
evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int | None = 0,
num_repetitions: int = 1,
client: Client | None = None,
blocking: bool = True,
experiment: TracerSession | str | UUID | None = None,
upload_results: bool = True,
error_handling: Literal["log", "ignore"] = "log",
**kwargs: Any,
) -> AsyncExperimentResults
在给定数据集上评估一个异步目标系统。
| 参数 | 描述 |
|---|---|
target
|
要评估的目标系统或实验。可以是一个接受字典并返回字典的异步函数,一个 langchain Runnable,一个现有的实验 ID,或者一个由两个实验 ID 组成的元组。
类型: |
data
|
用于评估的数据集。可以是一个数据集名称,一个示例列表,一个异步示例生成器,或一个异步示例可迭代对象。
类型: |
evaluators
|
在每个示例上运行的评估器列表。默认为 None。
类型: |
summary_evaluators
|
在整个数据集上运行的摘要评估器列表。默认为 None。
类型: |
metadata
|
附加到实验的元数据。默认为 None。
类型: |
experiment_prefix
|
为您的实验名称提供的前缀。默认为 None。
类型: |
描述
|
实验的描述。
类型: |
max_concurrency
|
要运行的最大并发评估数。如果为 None,则不设限制。如果为 0,则不进行并发。默认为 0。
类型: |
num_repetitions
|
运行评估的次数。数据集中的每个项目都将运行和评估这么多次。默认为 1。
类型: |
client
|
要使用的 LangSmith 客户端。默认为 None。
类型: |
blocking
|
是否阻塞直到评估完成。默认为 True。
类型: |
experiment
|
要扩展的现有实验。如果提供,则忽略 experiment_prefix。仅限高级用法。
类型: |
load_nested
|
是否加载实验的所有子运行。默认仅加载顶层根运行。仅在评估现有实验时指定。
|
error_handling
|
如何处理单个运行错误。'log' 会将带有错误消息的运行作为实验的一部分进行追踪,'ignore' 则完全不将该运行计入实验。
类型: |
| 返回 | 描述 |
|---|---|
AsyncExperimentResults
|
AsyncIterator[ExperimentResultRow]:一个遍历实验结果的异步迭代器。 |
环境
- LANGSMITH_TEST_CACHE:如果设置,API 调用将被缓存到磁盘,以在测试期间节省时间和成本。建议将缓存文件提交到您的仓库,以加快 CI/CD 运行。需要安装 'langsmith[vcr]' 包。
示例
>>> from typing import Sequence
>>> from langsmith import Client, aevaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
基本用法
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
>>> import asyncio
>>> async def apredict(inputs: dict) -> dict:
... # This can be any async function or just an API call to your app.
... await asyncio.sleep(0.1)
... return {"output": "Yes"}
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Evaluate the accuracy of the model asynchronously.",
... metadata={
... "my-prompt-version": "abcd-1234",
... },
... )
... )
View the evaluation results for experiment:...
使用异步生成器仅对示例子集进行评估
>>> async def example_generator():
... examples = client.list_examples(dataset_name=dataset_name, limit=5)
... for example in examples:
... yield example
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=example_generator(),
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Subset Experiment",
... description="Evaluate a subset of examples asynchronously.",
... )
... )
View the evaluation results for experiment:...
流式传输每个预测以便更轻松、更主动地进行调试。
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Streaming Experiment",
... description="Streaming predictions for debugging.",
... blocking=False,
... )
... )
View the evaluation results for experiment:...
>>> async def aenumerate(iterable):
... async for elem in iterable:
... print(elem)
>>> asyncio.run(aenumerate(results))
不使用并发运行
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment Without Concurrency",
... description="This was run without concurrency.",
... max_concurrency=0,
... )
... )
View the evaluation results for experiment:...
使用异步评估器
>>> async def helpfulness(run: Run, example: Example):
... # Row-level evaluator for helpfulness.
... await asyncio.sleep(5) # Replace with your LLM API call
... return {"score": run.outputs["output"] == "Yes"}
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[helpfulness],
... summary_evaluators=[precision],
... experiment_prefix="My Helpful Experiment",
... description="Applying async evaluators example.",
... )
... )
View the evaluation results for experiment:...
.. versionchanged:: 0.2.0
'max_concurrency' default updated from None (no limit on concurrency)
to 0 (no concurrency at all).
aevaluate_existing async ¶
aevaluate_existing(
experiment: str | UUID | TracerSession,
/,
evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
max_concurrency: int | None = 0,
client: Client | None = None,
load_nested: bool = False,
blocking: bool = True,
) -> AsyncExperimentResults
异步评估现有的实验运行。
| 参数 | 描述 |
|---|---|
experiment
|
要评估的实验的标识符。 |
evaluators
|
用于单个运行评估的可选评估器序列。
类型: |
summary_evaluators
|
应用于整个数据集的可选评估器序列。
类型: |
metadata
|
包含在评估结果中的可选元数据。
类型: |
max_concurrency
|
要运行的最大并发评估数。如果为 None,则不设限制。如果为 0,则不进行并发。默认为 0。
类型: |
client
|
用于评估的可选 Langsmith 客户端。
类型: |
load_nested
|
是否加载实验的所有子运行。默认仅加载顶层根运行。
类型: |
blocking
|
是否阻塞直到评估完成。
类型: |
| 返回 | 描述 |
|---|---|
AsyncExperimentResults
|
AsyncIterator[ExperimentResultRow]:一个遍历实验结果的异步迭代器。 |
示例
定义您的评估器
>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
加载实验并运行评估。
>>> import asyncio
>>> import uuid
>>> from langsmith import Client, aevaluate, aevaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_aevaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
... inputs={"question": "What is 2+2?"},
... outputs={"answer": "4"},
... dataset_id=dataset.id,
... )
>>> async def apredict(inputs: dict) -> dict:
... await asyncio.sleep(0.001)
... return {"output": "4"}
>>> results = asyncio.run(
... aevaluate(
... apredict, data=dataset_name, experiment_prefix="doctest_experiment"
... )
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Consume all results to ensure evaluation is complete
>>> async def consume_results():
... result_list = [r async for r in results]
... return len(result_list) > 0
>>> asyncio.run(consume_results())
True
>>> import time
>>> time.sleep(3)
>>> results = asyncio.run(
... aevaluate_existing(
... experiment_id,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... )
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)
evaluate ¶
evaluate(
target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],
/,
data: DATA_T | None = None,
evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int | None = 0,
num_repetitions: int = 1,
client: Client | None = None,
blocking: bool = True,
experiment: EXPERIMENT_T | None = None,
upload_results: bool = True,
error_handling: Literal["log", "ignore"] = "log",
**kwargs: Any,
) -> ExperimentResults | ComparativeExperimentResults
在给定数据集上评估一个目标系统。
| 参数 | 描述 |
|---|---|
target
|
要评估的目标系统或实验。可以是一个接受字典并返回字典的函数,一个 langchain Runnable,一个现有的实验 ID,或者一个由两个实验 ID 组成的元组。
类型: |
data
|
用于评估的数据集。可以是一个数据集名称,一个示例列表,或一个示例生成器。
类型: |
evaluators
|
在每个示例上运行的评估器列表。评估器签名取决于目标类型。默认为 None。
类型: |
summary_evaluators
|
在整个数据集上运行的摘要评估器列表。如果在比较两个现有实验,则不应指定。默认为 None。
类型: |
metadata
|
附加到实验的元数据。默认为 None。
类型: |
experiment_prefix
|
为您的实验名称提供的前缀。默认为 None。
类型: |
描述
|
实验的自由格式文本描述。
类型: |
max_concurrency
|
要运行的最大并发评估数。如果为 None,则不设限制。如果为 0,则不进行并发。默认为 0。
类型: |
client
|
要使用的 LangSmith 客户端。默认为 None。
类型: |
blocking
|
是否阻塞直到评估完成。默认为 True。
类型: |
num_repetitions
|
运行评估的次数。数据集中的每个项目都将运行和评估这么多次。默认为 1。
类型: |
experiment
|
要扩展的现有实验。如果提供,则忽略 experiment_prefix。仅限高级用法。如果目标是现有实验或由两个实验组成的元组,则不应指定。
类型: |
load_nested
|
是否加载实验的所有子运行。默认仅加载顶层根运行。仅在目标是现有实验或由两个实验组成的元组时指定。
类型: |
randomize_order
|
是否为每次评估随机化输出的顺序。默认为 False。仅在目标是由两个现有实验组成的元组时指定。
类型: |
error_handling
|
如何处理单个运行错误。'log' 会将带有错误消息的运行作为实验的一部分进行追踪,'ignore' 则完全不将该运行计入实验。
类型: |
| 返回 | 描述 |
|---|---|
ExperimentResults
|
如果目标是一个函数、Runnable 或现有实验。
类型: |
ComparativeExperimentResults
|
如果目标是由两个现有实验组成的元组。
类型: |
示例
准备数据集
>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
基本用法
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
>>> def predict(inputs: dict) -> dict:
... # This can be any function or just an API call to your app.
... return {"output": "Yes"}
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Evaluating the accuracy of a simple prediction model.",
... metadata={
... "my-prompt-version": "abcd-1234",
... },
... )
View the evaluation results for experiment:...
仅对示例子集进行评估
>>> experiment_name = results.experiment_name
>>> examples = client.list_examples(dataset_name=dataset_name, limit=5)
>>> results = evaluate(
... predict,
... data=examples,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Just testing a subset synchronously.",
... )
View the evaluation results for experiment:...
流式传输每个预测以便更轻松、更主动地进行调试。
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... description="I don't even have to block!",
... blocking=False,
... )
View the evaluation results for experiment:...
>>> for i, result in enumerate(results):
... pass
使用 `evaluate` API 与现成的 LangChain 评估器
>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> def prepare_criteria_data(run: Run, example: Example):
... return {
... "prediction": run.outputs["output"],
... "reference": example.outputs["answer"],
... "input": str(example.inputs),
... }
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[
... accuracy,
... LangChainStringEvaluator("embedding_distance"),
... LangChainStringEvaluator(
... "labeled_criteria",
... config={
... "criteria": {
... "usefulness": "The prediction is useful if it is correct"
... " and/or asks a useful followup question."
... },
... "llm": ChatOpenAI(model="gpt-4o"),
... },
... prepare_data=prepare_criteria_data,
... ),
... ],
... description="Evaluating with off-the-shelf LangChain evaluators.",
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
评估一个 LangChain 对象
>>> from langchain_core.runnables import chain as as_runnable
>>> @as_runnable
... def nested_predict(inputs):
... return {"output": "Yes"}
>>> @as_runnable
... def lc_predict(inputs):
... return nested_predict.invoke(inputs)
>>> results = evaluate(
... lc_predict.invoke,
... data=dataset_name,
... evaluators=[accuracy],
... description="This time we're evaluating a LangChain object.",
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
.. versionchanged:: 0.2.0
'max_concurrency' default updated from None (no limit on concurrency)
to 0 (no concurrency at all).
evaluate_comparative ¶
evaluate_comparative(
experiments: tuple[EXPERIMENT_T, EXPERIMENT_T],
/,
evaluators: Sequence[COMPARATIVE_EVALUATOR_T],
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int = 5,
client: Client | None = None,
metadata: dict | None = None,
load_nested: bool = False,
randomize_order: bool = False,
) -> ComparativeExperimentResults
相互比较评估现有的实验运行。
这让您可以使用成对偏好评分在实验中生成更可靠的反馈。
| 参数 | 描述 |
|---|---|
experiments
|
要比较的实验的标识符。 |
evaluators
|
在每个示例上运行的评估器列表。
类型: |
experiment_prefix
|
为您的实验名称提供的前缀。默认为 None。
类型: |
描述
|
实验的自由格式文本描述。
类型: |
max_concurrency
|
要运行的最大并发评估数。默认为 5。
类型: |
client
|
要使用的 LangSmith 客户端。默认为 None。
类型: |
metadata
|
附加到实验的元数据。默认为 None。
类型: |
load_nested
|
是否加载实验的所有子运行。默认仅加载顶层根运行。
类型: |
randomize_order
|
是否为每次评估随机化输出的顺序。默认为 False。
类型: |
| 返回 | 描述 |
|---|---|
ComparativeExperimentResults
|
比较评估的结果。
类型: |
示例
假设您想比较两个提示,看看哪个更有效。您首先需要准备您的数据集
>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
然后您将运行您的不同提示
>>> import functools
>>> import openai
>>> from langsmith.evaluation import evaluate
>>> from langsmith.wrappers import wrap_openai
>>> oai_client = openai.Client()
>>> wrapped_client = wrap_openai(oai_client)
>>> prompt_1 = "You are a helpful assistant."
>>> prompt_2 = "You are an exceedingly helpful assistant."
>>> def predict(inputs: dict, prompt: str) -> dict:
... completion = wrapped_client.chat.completions.create(
... model="gpt-4o-mini",
... messages=[
... {"role": "system", "content": prompt},
... {
... "role": "user",
... "content": f"Context: {inputs['context']}"
... f"\n\ninputs['question']",
... },
... ],
... )
... return {"output": completion.choices[0].message.content}
>>> results_1 = evaluate(
... functools.partial(predict, prompt=prompt_1),
... data=dataset_name,
... description="Evaluating our basic system prompt.",
... blocking=False, # Run these experiments in parallel
... )
View the evaluation results for experiment:...
>>> results_2 = evaluate(
... functools.partial(predict, prompt=prompt_2),
... data=dataset_name,
... description="Evaluating our advanced system prompt.",
... blocking=False,
... )
View the evaluation results for experiment:...
>>> results_1.wait()
>>> results_2.wait()
Finally, you would compare the two prompts directly:
>>> import json
>>> from langsmith.evaluation import evaluate_comparative
>>> from langsmith import schemas
>>> def score_preferences(runs: list, example: schemas.Example):
... assert len(runs) == 2 # Comparing 2 systems
... assert isinstance(example, schemas.Example)
... assert all(run.reference_example_id == example.id for run in runs)
... pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
... pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
... ground_truth = example.outputs["answer"] if example.outputs else ""
... tools = [
... {
... "type": "function",
... "function": {
... "name": "rank_preferences",
... "description": "Saves the prefered response ('A' or 'B')",
... "parameters": {
... "type": "object",
... "properties": {
... "reasoning": {
... "type": "string",
... "description": "The reasoning behind the choice.",
... },
... "preferred_option": {
... "type": "string",
... "enum": ["A", "B"],
... "description": "The preferred option, either 'A' or 'B'",
... },
... },
... "required": ["preferred_option"],
... },
... },
... }
... ]
... completion = openai.Client().chat.completions.create(
... model="gpt-4o-mini",
... messages=[
... {"role": "system", "content": "Select the better response."},
... {
... "role": "user",
... "content": f"Option A: {pred_a}"
... f"\n\nOption B: {pred_b}"
... f"\n\nGround Truth: {ground_truth}",
... },
... ],
... tools=tools,
... tool_choice={
... "type": "function",
... "function": {"name": "rank_preferences"},
... },
... )
... tool_args = completion.choices[0].message.tool_calls[0].function.arguments
... loaded_args = json.loads(tool_args)
... preference = loaded_args["preferred_option"]
... comment = loaded_args["reasoning"]
... if preference == "A":
... return {
... "key": "ranked_preference",
... "scores": {runs[0].id: 1, runs[1].id: 0},
... "comment": comment,
... }
... else:
... return {
... "key": "ranked_preference",
... "scores": {runs[0].id: 0, runs[1].id: 1},
... "comment": comment,
... }
>>> def score_length_difference(runs: list, example: schemas.Example):
... # Just return whichever response is longer.
... # Just an example, not actually useful in real life.
... assert len(runs) == 2 # Comparing 2 systems
... assert isinstance(example, schemas.Example)
... assert all(run.reference_example_id == example.id for run in runs)
... pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
... pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
... if len(pred_a) > len(pred_b):
... return {
... "key": "length_difference",
... "scores": {runs[0].id: 1, runs[1].id: 0},
... }
... else:
... return {
... "key": "length_difference",
... "scores": {runs[0].id: 0, runs[1].id: 1},
... }
>>> results = evaluate_comparative(
... [results_1.experiment_name, results_2.experiment_name],
... evaluators=[score_preferences, score_length_difference],
... client=client,
... )
View the pairwise evaluation results at:...
>>> eval_results = list(results)
>>> assert len(eval_results) >= 10
>>> assert all(
... "feedback.ranked_preference" in r["evaluation_results"]
... for r in eval_results
... )
>>> assert all(
... "feedback.length_difference" in r["evaluation_results"]
... for r in eval_results
... )
evaluate_existing ¶
evaluate_existing(
experiment: str | UUID | TracerSession,
/,
evaluators: Sequence[EVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
max_concurrency: int | None = 0,
client: Client | None = None,
load_nested: bool = False,
blocking: bool = True,
) -> ExperimentResults
评估现有的实验运行。
| 参数 | 描述 |
|---|---|
experiment
|
要评估的实验的标识符。 |
data
|
用于评估的数据。
类型: |
evaluators
|
用于单个运行评估的可选评估器序列。
类型: |
summary_evaluators
|
应用于整个数据集的可选评估器序列。
类型: |
metadata
|
包含在评估结果中的可选元数据。
类型: |
max_concurrency
|
要运行的最大并发评估数。如果为 None,则不设限制。如果为 0,则不进行并发。默认为 0。
类型: |
client
|
用于评估的可选 Langsmith 客户端。
类型: |
load_nested
|
是否加载实验的所有子运行。默认仅加载顶层根运行。
类型: |
blocking
|
是否阻塞直到评估完成。
类型: |
| 返回 | 描述 |
|---|---|
ExperimentResults
|
评估结果。
类型: |
环境
- LANGSMITH_TEST_CACHE:如果设置,API 调用将被缓存到磁盘,以在测试期间节省时间和成本。建议将缓存文件提交到您的仓库,以加快 CI/CD 运行。需要安装 'langsmith[vcr]' 包。
示例
定义您的评估器
>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
加载实验并运行评估。
>>> import uuid
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate, evaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_evaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
... inputs={"question": "What is 2+2?"},
... outputs={"answer": "4"},
... dataset_id=dataset.id,
... )
>>> def predict(inputs: dict) -> dict:
... return {"output": "4"}
>>> # First run inference on the dataset
... results = evaluate(
... predict, data=dataset_name, experiment_prefix="doctest_experiment"
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Wait for the experiment to be fully processed and check if we have results
>>> len(results) > 0
True
>>> import time
>>> time.sleep(2)
>>> results = evaluate_existing(
... experiment_id,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)