评估

langsmith.evaluation ¶

评估助手函数。

函数	描述
`aevaluate`	在给定数据集上评估一个异步目标系统。
`aevaluate_existing`	异步评估现有的实验运行。
`evaluate`	在给定数据集上评估一个目标系统。
`evaluate_comparative`	相互比较评估现有的实验运行。
`evaluate_existing`	评估现有的实验运行。
`run_evaluator`	从一个函数创建一个运行评估器。

EvaluationResult ¶

基类: BaseModel

评估结果。

方法	描述
`check_value_non_numeric`	检查值是否为非数值。

key `instance-attribute` ¶

key: str

此次评估的方面、指标名称或标签。

score `class-attribute` `instance-attribute` ¶

score: SCORE_TYPE = None

此次评估的数值分数。

value `class-attribute` `instance-attribute` ¶

value: VALUE_TYPE = None

此次评估的值，如果不是数值类型。

comment `class-attribute` `instance-attribute` ¶

comment: str | None = None

关于评估的解释说明。

correction `class-attribute` `instance-attribute` ¶

correction: dict | None = None

正确的值应该是什么，如果适用的话。

evaluator_info `class-attribute` `instance-attribute` ¶

evaluator_info: dict = Field(default_factory=dict)

关于评估器的附加信息。

feedback_config `class-attribute` `instance-attribute` ¶

feedback_config: FeedbackConfig | dict | None = None

用于生成此反馈的配置。

source_run_id `class-attribute` `instance-attribute` ¶

source_run_id: UUID | str | None = None

评估器本身追踪的ID。

target_run_id `class-attribute` `instance-attribute` ¶

target_run_id: UUID | str | None = None

此评估所应用的追踪的ID。

如果未提供，评估反馈将应用于根追踪。

extra `class-attribute` `instance-attribute` ¶

extra: dict | None = None

评估器运行的元数据。

Config ¶

Pydantic 模型配置。

check_value_non_numeric ¶

check_value_non_numeric(v, values)

检查值是否为非数值。

EvaluationResults ¶

基类：TypedDict

批量评估结果。

这使您的评估器可以轻松地一次返回多个指标。

results `instance-attribute` ¶

results: list[EvaluationResult]

评估结果。

RunEvaluator ¶

评估器接口类。

方法	描述
`evaluate_run`	评估一个示例。
`aevaluate_run`	异步评估一个示例。

evaluate_run `abstractmethod` ¶

evaluate_run(
    run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults

评估一个示例。

aevaluate_run `async` ¶

aevaluate_run(
    run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults

异步评估一个示例。

LangChainStringEvaluator ¶

用于包装 LangChain StringEvaluator 的类。

需要安装 langchain 包。

属性	描述
`evaluator`	底层的 StringEvaluator 或要加载的评估器的名称。类型： `StringEvaluator`

方法	描述
`as_run_evaluator`	将 LangChainStringEvaluator 转换为 RunEvaluator。

示例

创建一个简单的 LangChainStringEvaluator

>>> evaluator = LangChainStringEvaluator("exact_match")

将 LangChainStringEvaluator 转换为 RunEvaluator

>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> evaluator = LangChainStringEvaluator(
...     "criteria",
...     config={
...         "criteria": {
...             "usefulness": "The prediction is useful if"
...             " it is correct and/or asks a useful followup question."
...         },
...         "llm": ChatOpenAI(model="gpt-4o"),
...     },
... )
>>> run_evaluator = evaluator.as_run_evaluator()
>>> run_evaluator
<DynamicRunEvaluator ...>

自定义评估器使用的 LLM 模型

>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_anthropic import ChatAnthropic
>>> evaluator = LangChainStringEvaluator(
...     "criteria",
...     config={
...         "criteria": {
...             "usefulness": "The prediction is useful if"
...             " it is correct and/or asks a useful followup question."
...         },
...         "llm": ChatAnthropic(model="claude-3-opus-20240229"),
...     },
... )
>>> run_evaluator = evaluator.as_run_evaluator()
>>> run_evaluator
<DynamicRunEvaluator ...>

使用 `evaluate` API 与不同的评估器

>>> def prepare_data(run: Run, example: Example):
...     # Convert the evaluation data into the format expected by the evaluator
...     # Only required for datasets with multiple inputs/output keys
...     return {
...         "prediction": run.outputs["prediction"],
...         "reference": example.outputs["answer"],
...         "input": str(example.inputs),
...     }
>>> import re
>>> from langchain_anthropic import ChatAnthropic
>>> import langsmith
>>> from langsmith.evaluation import LangChainStringEvaluator, evaluate
>>> criteria_evaluator = LangChainStringEvaluator(
...     "criteria",
...     config={
...         "criteria": {
...             "usefulness": "The prediction is useful if it is correct"
...             " and/or asks a useful followup question."
...         },
...         "llm": ChatAnthropic(model="claude-3-opus-20240229"),
...     },
...     prepare_data=prepare_data,
... )
>>> embedding_evaluator = LangChainStringEvaluator("embedding_distance")
>>> exact_match_evaluator = LangChainStringEvaluator("exact_match")
>>> regex_match_evaluator = LangChainStringEvaluator(
...     "regex_match", config={"flags": re.IGNORECASE}, prepare_data=prepare_data
... )
>>> scoring_evaluator = LangChainStringEvaluator(
...     "labeled_score_string",
...     config={
...         "criteria": {
...             "accuracy": "Score 1: Completely inaccurate\nScore 5: Somewhat accurate\nScore 10: Completely accurate"
...         },
...         "normalize_by": 10,
...         "llm": ChatAnthropic(model="claude-3-opus-20240229"),
...     },
...     prepare_data=prepare_data,
... )
>>> string_distance_evaluator = LangChainStringEvaluator(
...     "string_distance",
...     config={"distance_metric": "levenshtein"},
...     prepare_data=prepare_data,
... )
>>> from langsmith import Client
>>> client = Client()
>>> results = evaluate(
...     lambda inputs: {"prediction": "foo"},
...     data=client.list_examples(dataset_name="Evaluate Examples", limit=1),
...     evaluators=[
...         embedding_evaluator,
...         criteria_evaluator,
...         exact_match_evaluator,
...         regex_match_evaluator,
...         scoring_evaluator,
...         string_distance_evaluator,
...     ],
... )
View the evaluation results for experiment:...

init ¶

__init__(
    evaluator: StringEvaluator | str,
    *,
    config: dict | None = None,
    prepare_data: Callable[[Run, Optional[Example]], SingleEvaluatorInput]
    | None = None,
)

初始化一个 LangChainStringEvaluator。

请参阅：https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.StringEvaluator.html#langchain-evaluation-schema-stringevaluator

参数	描述
`evaluator`	底层的 StringEvaluator。类型： `StringEvaluator`

as_run_evaluator ¶

as_run_evaluator() -> RunEvaluator

将 LangChainStringEvaluator 转换为 RunEvaluator。

这是 LangSmith `evaluate` API 中使用的对象。

返回	描述
`RunEvaluator`	转换后的 RunEvaluator。类型： `RunEvaluator`

aevaluate `async` ¶

aevaluate(
    target: ATARGET_T | AsyncIterable[dict] | Runnable | str | UUID | TracerSession,
    /,
    data: DATA_T | AsyncIterable[Example] | Iterable[Example] | None = None,
    evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
    summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
    metadata: dict | None = None,
    experiment_prefix: str | None = None,
    description: str | None = None,
    max_concurrency: int | None = 0,
    num_repetitions: int = 1,
    client: Client | None = None,
    blocking: bool = True,
    experiment: TracerSession | str | UUID | None = None,
    upload_results: bool = True,
    error_handling: Literal["log", "ignore"] = "log",
    **kwargs: Any,
) -> AsyncExperimentResults

在给定数据集上评估一个异步目标系统。

参数	描述
`target`	要评估的目标系统或实验。可以是一个接受字典并返回字典的异步函数，一个 langchain Runnable，一个现有的实验 ID，或者一个由两个实验 ID 组成的元组。类型： `AsyncCallable[[dict], dict] \| AsyncIterable[dict] \| Runnable \| EXPERIMENT_T \| Tuple[EXPERIMENT_T, EXPERIMENT_T]`
`data`	用于评估的数据集。可以是一个数据集名称，一个示例列表，一个异步示例生成器，或一个异步示例可迭代对象。类型： `DATA_T \| AsyncIterable[Example]` 默认值： `None`
`evaluators`	在每个示例上运行的评估器列表。默认为 None。类型： `Sequence[EVALUATOR_T] \| None` 默认值： `None`
`summary_evaluators`	在整个数据集上运行的摘要评估器列表。默认为 None。类型： `Sequence[SUMMARY_EVALUATOR_T] \| None` 默认值： `None`
`metadata`	附加到实验的元数据。默认为 None。类型： `dict \| None` 默认值： `None`
`experiment_prefix`	为您的实验名称提供的前缀。默认为 None。类型: `str \| None` 默认值: `None`
`描述`	实验的描述。类型: `str \| None` 默认值: `None`
`max_concurrency`	要运行的最大并发评估数。如果为 None，则不设限制。如果为 0，则不进行并发。默认为 0。类型： `int \| None` 默认值： `0`
`num_repetitions`	运行评估的次数。数据集中的每个项目都将运行和评估这么多次。默认为 1。类型： `int` 默认值： `1`
`client`	要使用的 LangSmith 客户端。默认为 None。类型： `Client \| None` 默认值： `None`
`blocking`	是否阻塞直到评估完成。默认为 True。类型: `bool` 默认值: `True`
`experiment`	要扩展的现有实验。如果提供，则忽略 experiment_prefix。仅限高级用法。类型： `TracerSession \| None` 默认值： `None`
`load_nested`	是否加载实验的所有子运行。默认仅加载顶层根运行。仅在评估现有实验时指定。
`error_handling`	如何处理单个运行错误。'log' 会将带有错误消息的运行作为实验的一部分进行追踪，'ignore' 则完全不将该运行计入实验。类型： `str, default="log"` 默认值： `'log'`

返回	描述
`AsyncExperimentResults`	AsyncIterator[ExperimentResultRow]：一个遍历实验结果的异步迭代器。

环境

LANGSMITH_TEST_CACHE：如果设置，API 调用将被缓存到磁盘，以在测试期间节省时间和成本。建议将缓存文件提交到您的仓库，以加快 CI/CD 运行。需要安装 'langsmith[vcr]' 包。

示例

>>> from typing import Sequence
>>> from langsmith import Client, aevaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
...     "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"

基本用法

>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}

>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}

>>> import asyncio
>>> async def apredict(inputs: dict) -> dict:
...     # This can be any async function or just an API call to your app.
...     await asyncio.sleep(0.1)
...     return {"output": "Yes"}
>>> results = asyncio.run(
...     aevaluate(
...         apredict,
...         data=dataset_name,
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...         experiment_prefix="My Experiment",
...         description="Evaluate the accuracy of the model asynchronously.",
...         metadata={
...             "my-prompt-version": "abcd-1234",
...         },
...     )
... )
View the evaluation results for experiment:...

使用异步生成器仅对示例子集进行评估

>>> async def example_generator():
...     examples = client.list_examples(dataset_name=dataset_name, limit=5)
...     for example in examples:
...         yield example
>>> results = asyncio.run(
...     aevaluate(
...         apredict,
...         data=example_generator(),
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...         experiment_prefix="My Subset Experiment",
...         description="Evaluate a subset of examples asynchronously.",
...     )
... )
View the evaluation results for experiment:...

流式传输每个预测以便更轻松、更主动地进行调试。

>>> results = asyncio.run(
...     aevaluate(
...         apredict,
...         data=dataset_name,
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...         experiment_prefix="My Streaming Experiment",
...         description="Streaming predictions for debugging.",
...         blocking=False,
...     )
... )
View the evaluation results for experiment:...

>>> async def aenumerate(iterable):
...     async for elem in iterable:
...         print(elem)
>>> asyncio.run(aenumerate(results))

不使用并发运行

>>> results = asyncio.run(
...     aevaluate(
...         apredict,
...         data=dataset_name,
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...         experiment_prefix="My Experiment Without Concurrency",
...         description="This was run without concurrency.",
...         max_concurrency=0,
...     )
... )
View the evaluation results for experiment:...

使用异步评估器

>>> async def helpfulness(run: Run, example: Example):
...     # Row-level evaluator for helpfulness.
...     await asyncio.sleep(5)  # Replace with your LLM API call
...     return {"score": run.outputs["output"] == "Yes"}

>>> results = asyncio.run(
...     aevaluate(
...         apredict,
...         data=dataset_name,
...         evaluators=[helpfulness],
...         summary_evaluators=[precision],
...         experiment_prefix="My Helpful Experiment",
...         description="Applying async evaluators example.",
...     )
... )
View the evaluation results for experiment:...

.. versionchanged:: 0.2.0

'max_concurrency' default updated from None (no limit on concurrency)
to 0 (no concurrency at all).

aevaluate_existing `async` ¶

aevaluate_existing(
    experiment: str | UUID | TracerSession,
    /,
    evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
    summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
    metadata: dict | None = None,
    max_concurrency: int | None = 0,
    client: Client | None = None,
    load_nested: bool = False,
    blocking: bool = True,
) -> AsyncExperimentResults

异步评估现有的实验运行。

参数	描述
`experiment`	要评估的实验的标识符。类型： `str \| UUID`
`evaluators`	用于单个运行评估的可选评估器序列。类型： `Sequence[EVALUATOR_T] \| None` 默认值： `None`
`summary_evaluators`	应用于整个数据集的可选评估器序列。类型： `Sequence[SUMMARY_EVALUATOR_T] \| None` 默认值： `None`
`metadata`	包含在评估结果中的可选元数据。类型： `dict \| None` 默认值： `None`
`max_concurrency`	要运行的最大并发评估数。如果为 None，则不设限制。如果为 0，则不进行并发。默认为 0。类型： `int \| None` 默认值： `0`
`client`	用于评估的可选 Langsmith 客户端。类型： `Client \| None` 默认值： `None`
`load_nested`	是否加载实验的所有子运行。默认仅加载顶层根运行。类型： `bool` 默认值： `False`
`blocking`	是否阻塞直到评估完成。类型: `bool` 默认值: `True`

返回	描述
`AsyncExperimentResults`	AsyncIterator[ExperimentResultRow]：一个遍历实验结果的异步迭代器。

示例

定义您的评估器

>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}

加载实验并运行评估。

>>> import asyncio
>>> import uuid
>>> from langsmith import Client, aevaluate, aevaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_aevaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
...     inputs={"question": "What is 2+2?"},
...     outputs={"answer": "4"},
...     dataset_id=dataset.id,
... )
>>> async def apredict(inputs: dict) -> dict:
...     await asyncio.sleep(0.001)
...     return {"output": "4"}
>>> results = asyncio.run(
...     aevaluate(
...         apredict, data=dataset_name, experiment_prefix="doctest_experiment"
...     )
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Consume all results to ensure evaluation is complete
>>> async def consume_results():
...     result_list = [r async for r in results]
...     return len(result_list) > 0
>>> asyncio.run(consume_results())
True
>>> import time
>>> time.sleep(3)
>>> results = asyncio.run(
...     aevaluate_existing(
...         experiment_id,
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...     )
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)

evaluate ¶

evaluate(
    target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],
    /,
    data: DATA_T | None = None,
    evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,
    summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
    metadata: dict | None = None,
    experiment_prefix: str | None = None,
    description: str | None = None,
    max_concurrency: int | None = 0,
    num_repetitions: int = 1,
    client: Client | None = None,
    blocking: bool = True,
    experiment: EXPERIMENT_T | None = None,
    upload_results: bool = True,
    error_handling: Literal["log", "ignore"] = "log",
    **kwargs: Any,
) -> ExperimentResults | ComparativeExperimentResults

在给定数据集上评估一个目标系统。

参数	描述
`target`	要评估的目标系统或实验。可以是一个接受字典并返回字典的函数，一个 langchain Runnable，一个现有的实验 ID，或者一个由两个实验 ID 组成的元组。类型： `TARGET_T \| Runnable \| EXPERIMENT_T \| Tuple[EXPERIMENT_T, EXPERIMENT_T]`
`data`	用于评估的数据集。可以是一个数据集名称，一个示例列表，或一个示例生成器。类型： `DATA_T` 默认值： `None`
`evaluators`	在每个示例上运行的评估器列表。评估器签名取决于目标类型。默认为 None。类型： `Sequence[EVALUATOR_T] \| Sequence[COMPARATIVE_EVALUATOR_T] \| None` 默认值： `None`
`summary_evaluators`	在整个数据集上运行的摘要评估器列表。如果在比较两个现有实验，则不应指定。默认为 None。类型： `Sequence[SUMMARY_EVALUATOR_T] \| None` 默认值： `None`
`metadata`	附加到实验的元数据。默认为 None。类型： `dict \| None` 默认值： `None`
`experiment_prefix`	为您的实验名称提供的前缀。默认为 None。类型: `str \| None` 默认值: `None`
`描述`	实验的自由格式文本描述。类型: `str \| None` 默认值: `None`
`max_concurrency`	要运行的最大并发评估数。如果为 None，则不设限制。如果为 0，则不进行并发。默认为 0。类型： `int \| None` 默认值： `0`
`client`	要使用的 LangSmith 客户端。默认为 None。类型： `Client \| None` 默认值： `None`
`blocking`	是否阻塞直到评估完成。默认为 True。类型: `bool` 默认值: `True`
`num_repetitions`	运行评估的次数。数据集中的每个项目都将运行和评估这么多次。默认为 1。类型： `int` 默认值： `1`
`experiment`	要扩展的现有实验。如果提供，则忽略 experiment_prefix。仅限高级用法。如果目标是现有实验或由两个实验组成的元组，则不应指定。类型： `TracerSession \| None` 默认值： `None`
`load_nested`	是否加载实验的所有子运行。默认仅加载顶层根运行。仅在目标是现有实验或由两个实验组成的元组时指定。类型： `bool`
`randomize_order`	是否为每次评估随机化输出的顺序。默认为 False。仅在目标是由两个现有实验组成的元组时指定。类型： `bool`
`error_handling`	如何处理单个运行错误。'log' 会将带有错误消息的运行作为实验的一部分进行追踪，'ignore' 则完全不将该运行计入实验。类型： `str, default="log"` 默认值： `'log'`

返回	描述
`ExperimentResults`	如果目标是一个函数、Runnable 或现有实验。类型： `ExperimentResults \| ComparativeExperimentResults`
`ComparativeExperimentResults`	如果目标是由两个现有实验组成的元组。类型： `ExperimentResults \| ComparativeExperimentResults`

示例

准备数据集

>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
...     "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"

基本用法

>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}
>>> def predict(inputs: dict) -> dict:
...     # This can be any function or just an API call to your app.
...     return {"output": "Yes"}
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Evaluating the accuracy of a simple prediction model.",
...     metadata={
...         "my-prompt-version": "abcd-1234",
...     },
... )
View the evaluation results for experiment:...

仅对示例子集进行评估

>>> experiment_name = results.experiment_name
>>> examples = client.list_examples(dataset_name=dataset_name, limit=5)
>>> results = evaluate(
...     predict,
...     data=examples,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Just testing a subset synchronously.",
... )
View the evaluation results for experiment:...

流式传输每个预测以便更轻松、更主动地进行调试。

>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     description="I don't even have to block!",
...     blocking=False,
... )
View the evaluation results for experiment:...
>>> for i, result in enumerate(results):
...     pass

使用 `evaluate` API 与现成的 LangChain 评估器

>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> def prepare_criteria_data(run: Run, example: Example):
...     return {
...         "prediction": run.outputs["output"],
...         "reference": example.outputs["answer"],
...         "input": str(example.inputs),
...     }
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[
...         accuracy,
...         LangChainStringEvaluator("embedding_distance"),
...         LangChainStringEvaluator(
...             "labeled_criteria",
...             config={
...                 "criteria": {
...                     "usefulness": "The prediction is useful if it is correct"
...                     " and/or asks a useful followup question."
...                 },
...                 "llm": ChatOpenAI(model="gpt-4o"),
...             },
...             prepare_data=prepare_criteria_data,
...         ),
...     ],
...     description="Evaluating with off-the-shelf LangChain evaluators.",
...     summary_evaluators=[precision],
... )
View the evaluation results for experiment:...

评估一个 LangChain 对象

>>> from langchain_core.runnables import chain as as_runnable
>>> @as_runnable
... def nested_predict(inputs):
...     return {"output": "Yes"}
>>> @as_runnable
... def lc_predict(inputs):
...     return nested_predict.invoke(inputs)
>>> results = evaluate(
...     lc_predict.invoke,
...     data=dataset_name,
...     evaluators=[accuracy],
...     description="This time we're evaluating a LangChain object.",
...     summary_evaluators=[precision],
... )
View the evaluation results for experiment:...

.. versionchanged:: 0.2.0

'max_concurrency' default updated from None (no limit on concurrency)
to 0 (no concurrency at all).

evaluate_comparative ¶

evaluate_comparative(
    experiments: tuple[EXPERIMENT_T, EXPERIMENT_T],
    /,
    evaluators: Sequence[COMPARATIVE_EVALUATOR_T],
    experiment_prefix: str | None = None,
    description: str | None = None,
    max_concurrency: int = 5,
    client: Client | None = None,
    metadata: dict | None = None,
    load_nested: bool = False,
    randomize_order: bool = False,
) -> ComparativeExperimentResults

相互比较评估现有的实验运行。

这让您可以使用成对偏好评分在实验中生成更可靠的反馈。

参数	描述
`experiments`	要比较的实验的标识符。类型： `Tuple[str \| UUID, str \| UUID]`
`evaluators`	在每个示例上运行的评估器列表。类型： `Sequence[COMPARATIVE_EVALUATOR_T]`
`experiment_prefix`	为您的实验名称提供的前缀。默认为 None。类型: `str \| None` 默认值: `None`
`描述`	实验的自由格式文本描述。类型: `str \| None` 默认值: `None`
`max_concurrency`	要运行的最大并发评估数。默认为 5。类型： `int` 默认值： `5`
`client`	要使用的 LangSmith 客户端。默认为 None。类型： `Client \| None` 默认值： `None`
`metadata`	附加到实验的元数据。默认为 None。类型： `dict \| None` 默认值： `None`
`load_nested`	是否加载实验的所有子运行。默认仅加载顶层根运行。类型： `bool` 默认值： `False`
`randomize_order`	是否为每次评估随机化输出的顺序。默认为 False。类型： `bool` 默认值： `False`

返回	描述
`ComparativeExperimentResults`	比较评估的结果。类型： `ComparativeExperimentResults`

示例

假设您想比较两个提示，看看哪个更有效。您首先需要准备您的数据集

>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
...     "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"

然后您将运行您的不同提示

>>> import functools
>>> import openai
>>> from langsmith.evaluation import evaluate
>>> from langsmith.wrappers import wrap_openai
>>> oai_client = openai.Client()
>>> wrapped_client = wrap_openai(oai_client)
>>> prompt_1 = "You are a helpful assistant."
>>> prompt_2 = "You are an exceedingly helpful assistant."
>>> def predict(inputs: dict, prompt: str) -> dict:
...     completion = wrapped_client.chat.completions.create(
...         model="gpt-4o-mini",
...         messages=[
...             {"role": "system", "content": prompt},
...             {
...                 "role": "user",
...                 "content": f"Context: {inputs['context']}"
...                 f"\n\ninputs['question']",
...             },
...         ],
...     )
...     return {"output": completion.choices[0].message.content}
>>> results_1 = evaluate(
...     functools.partial(predict, prompt=prompt_1),
...     data=dataset_name,
...     description="Evaluating our basic system prompt.",
...     blocking=False,  # Run these experiments in parallel
... )
View the evaluation results for experiment:...
>>> results_2 = evaluate(
...     functools.partial(predict, prompt=prompt_2),
...     data=dataset_name,
...     description="Evaluating our advanced system prompt.",
...     blocking=False,
... )
View the evaluation results for experiment:...
>>> results_1.wait()
>>> results_2.wait()

Finally, you would compare the two prompts directly:

>>> import json
>>> from langsmith.evaluation import evaluate_comparative
>>> from langsmith import schemas
>>> def score_preferences(runs: list, example: schemas.Example):
...     assert len(runs) == 2  # Comparing 2 systems
...     assert isinstance(example, schemas.Example)
...     assert all(run.reference_example_id == example.id for run in runs)
...     pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
...     pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
...     ground_truth = example.outputs["answer"] if example.outputs else ""
...     tools = [
...         {
...             "type": "function",
...             "function": {
...                 "name": "rank_preferences",
...                 "description": "Saves the prefered response ('A' or 'B')",
...                 "parameters": {
...                     "type": "object",
...                     "properties": {
...                         "reasoning": {
...                             "type": "string",
...                             "description": "The reasoning behind the choice.",
...                         },
...                         "preferred_option": {
...                             "type": "string",
...                             "enum": ["A", "B"],
...                             "description": "The preferred option, either 'A' or 'B'",
...                         },
...                     },
...                     "required": ["preferred_option"],
...                 },
...             },
...         }
...     ]
...     completion = openai.Client().chat.completions.create(
...         model="gpt-4o-mini",
...         messages=[
...             {"role": "system", "content": "Select the better response."},
...             {
...                 "role": "user",
...                 "content": f"Option A: {pred_a}"
...                 f"\n\nOption B: {pred_b}"
...                 f"\n\nGround Truth: {ground_truth}",
...             },
...         ],
...         tools=tools,
...         tool_choice={
...             "type": "function",
...             "function": {"name": "rank_preferences"},
...         },
...     )
...     tool_args = completion.choices[0].message.tool_calls[0].function.arguments
...     loaded_args = json.loads(tool_args)
...     preference = loaded_args["preferred_option"]
...     comment = loaded_args["reasoning"]
...     if preference == "A":
...         return {
...             "key": "ranked_preference",
...             "scores": {runs[0].id: 1, runs[1].id: 0},
...             "comment": comment,
...         }
...     else:
...         return {
...             "key": "ranked_preference",
...             "scores": {runs[0].id: 0, runs[1].id: 1},
...             "comment": comment,
...         }
>>> def score_length_difference(runs: list, example: schemas.Example):
...     # Just return whichever response is longer.
...     # Just an example, not actually useful in real life.
...     assert len(runs) == 2  # Comparing 2 systems
...     assert isinstance(example, schemas.Example)
...     assert all(run.reference_example_id == example.id for run in runs)
...     pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
...     pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
...     if len(pred_a) > len(pred_b):
...         return {
...             "key": "length_difference",
...             "scores": {runs[0].id: 1, runs[1].id: 0},
...         }
...     else:
...         return {
...             "key": "length_difference",
...             "scores": {runs[0].id: 0, runs[1].id: 1},
...         }
>>> results = evaluate_comparative(
...     [results_1.experiment_name, results_2.experiment_name],
...     evaluators=[score_preferences, score_length_difference],
...     client=client,
... )
View the pairwise evaluation results at:...
>>> eval_results = list(results)
>>> assert len(eval_results) >= 10
>>> assert all(
...     "feedback.ranked_preference" in r["evaluation_results"]
...     for r in eval_results
... )
>>> assert all(
...     "feedback.length_difference" in r["evaluation_results"]
...     for r in eval_results
... )

evaluate_existing ¶

evaluate_existing(
    experiment: str | UUID | TracerSession,
    /,
    evaluators: Sequence[EVALUATOR_T] | None = None,
    summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
    metadata: dict | None = None,
    max_concurrency: int | None = 0,
    client: Client | None = None,
    load_nested: bool = False,
    blocking: bool = True,
) -> ExperimentResults

评估现有的实验运行。

参数	描述
`experiment`	要评估的实验的标识符。类型： `str \| UUID`
`data`	用于评估的数据。类型： `DATA_T`
`evaluators`	用于单个运行评估的可选评估器序列。类型： `Sequence[EVALUATOR_T] \| None` 默认值： `None`
`summary_evaluators`	应用于整个数据集的可选评估器序列。类型： `Sequence[SUMMARY_EVALUATOR_T] \| None` 默认值： `None`
`metadata`	包含在评估结果中的可选元数据。类型： `dict \| None` 默认值： `None`
`max_concurrency`	要运行的最大并发评估数。如果为 None，则不设限制。如果为 0，则不进行并发。默认为 0。类型： `int \| None` 默认值： `0`
`client`	用于评估的可选 Langsmith 客户端。类型： `Client \| None` 默认值： `None`
`load_nested`	是否加载实验的所有子运行。默认仅加载顶层根运行。类型： `bool` 默认值： `False`
`blocking`	是否阻塞直到评估完成。类型: `bool` 默认值: `True`

返回	描述
`ExperimentResults`	评估结果。类型： `ExperimentResults`

环境

LANGSMITH_TEST_CACHE：如果设置，API 调用将被缓存到磁盘，以在测试期间节省时间和成本。建议将缓存文件提交到您的仓库，以加快 CI/CD 运行。需要安装 'langsmith[vcr]' 包。

示例

定义您的评估器

>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}

加载实验并运行评估。

>>> import uuid
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate, evaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_evaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
...     inputs={"question": "What is 2+2?"},
...     outputs={"answer": "4"},
...     dataset_id=dataset.id,
... )
>>> def predict(inputs: dict) -> dict:
...     return {"output": "4"}
>>> # First run inference on the dataset
... results = evaluate(
...     predict, data=dataset_name, experiment_prefix="doctest_experiment"
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Wait for the experiment to be fully processed and check if we have results
>>> len(results) > 0
True
>>> import time
>>> time.sleep(2)
>>> results = evaluate_existing(
...     experiment_id,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)

run_evaluator ¶

run_evaluator(
    func: Callable[
        [Run, Optional[Example]], _RUNNABLE_OUTPUT | Awaitable[_RUNNABLE_OUTPUT]
    ],
)

从一个函数创建一个运行评估器。

将函数转换为 `RunEvaluator` 的装饰器。

评估

langsmith.evaluation ¶

EvaluationResult ¶

key instance-attribute ¶

score class-attribute instance-attribute ¶

value class-attribute instance-attribute ¶

comment class-attribute instance-attribute ¶

correction class-attribute instance-attribute ¶

evaluator_info class-attribute instance-attribute ¶

feedback_config class-attribute instance-attribute ¶

source_run_id class-attribute instance-attribute ¶

target_run_id class-attribute instance-attribute ¶

extra class-attribute instance-attribute ¶