DeepEval: The Open-Source LLM Evaluation Framework

转载 confident-ai

A 推荐提升深度解析 | 约 9 分钟阅读更新于 2026-03-06

本文为开源社区精选内容，由 confident-ai 原创。文中链接将跳转到原始仓库，部分图片可能加载较慢。

AI 导读

The LLM Evaluation Framework Documentation | Metrics and Features | Getting Started | Integrations | DeepEval Platform DeepEval is a simple-to-use, open-source LLM evaluation framework, for...

DeepEval Logo

The LLM Evaluation Framework

LLM 评估框架 (LLM Evaluation Framework)

Documentation | Metrics and Features | Getting Started | Integrations | DeepEval Platform

文档 | 指标和功能 | 快速上手 | 集成 | DeepEval平台 (DeepEval Platform)

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run locally on your machine for evaluation.

DeepEval 是一个简单易用的开源 LLM 评估框架 (LLM evaluation framework)，用于评估和测试大型语言模型系统。它类似于 Pytest，但专门用于单元测试 LLM 输出。DeepEval 结合了最新的研究成果，基于 G-Eval (G-Eval)、任务完成度 (task completion)、答案相关性 (answer relevancy)、幻觉 (hallucination) 等指标来评估 LLM 输出，这些指标使用 LLM-as-a-judge 和其他 NLP 模型，这些模型在您的机器上本地运行以进行评估。

Whether your LLM applications are AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.

无论您的 LLM 应用程序是 AI 代理 (AI agents)、RAG 流水线 (RAG pipelines) 还是聊天机器人，通过 LangChain 或 OpenAI 实现，DeepEval 都能满足您的需求。有了它，您可以轻松确定最佳模型、提示和架构，以改进您的 RAG 流水线、代理工作流程，防止提示漂移，甚至自信地从 OpenAI 过渡到托管您自己的 Deepseek R1。

[!IMPORTANT] Need a place for your DeepEval testing data to live 🏡❤️? Sign up to the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.

[!IMPORTANT] 需要一个地方来存放您的 DeepEval 测试数据 🏡❤️？注册 DeepEval 平台 (DeepEval platform)，以比较 LLM 应用程序的迭代，生成和共享测试报告等等。

Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.

想讨论 LLM 评估 (LLM evaluation)，需要帮助选择指标，或者只是想打个招呼？欢迎加入我们的 Discord。

🔥 Metrics and Features

🔥 指标和功能

🥳 You can now share DeepEval's test results on the cloud directly on Confident AI

🥳 您现在可以直接在 Confident AI 上分享 DeepEval 的测试结果到云端

Supports both end-to-end and component-level LLM evaluation.

支持端到端和组件级别的 LLM 评估 (LLM evaluation)。

Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine:

G-Eval

大量即用型 LLM 评估指标（均带有解释），由您选择的任何 LLM、统计方法或在您机器上本地运行的 NLP 模型提供支持： G-Eval (G-Eval)

DAG (deep acyclic graph)

DAG（深度非循环图）(deep acyclic graph)

RAG metrics:

Answer Relevancy

RAG 指标：答案相关性 (Answer Relevancy)

Faithfulness

忠实性 (Faithfulness)

Contextual Recall

上下文召回率 (Contextual Recall)

Contextual Precision

上下文精确率 (Contextual Precision)

Contextual Relevancy

上下文相关性 (Contextual Relevancy)

RAGAS

Agentic metrics:

Task Completion

代理指标：任务完成度 (Task Completion)

Tool Correctness

工具正确性 (Tool Correctness)

Others:

Hallucination

其他：幻觉 (Hallucination)

Summarization

摘要 (Summarization)

Bias

Toxicity

Conversational metrics:

Knowledge Retention

会话指标：知识保留 (Knowledge Retention)

Conversation Completeness

会话完整性 (Conversation Completeness)

Conversation Relevancy

会话相关性 (Conversation Relevancy)

Role Adherence

角色坚持 (Role Adherence)

etc.

Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.

构建您自己的自定义指标，这些指标会自动与 DeepEval 的生态系统集成。

Generate synthetic datasets for evaluation.

生成用于评估的合成数据集。

Integrates seamlessly with ANY CI/CD environment.

与任何 CI/CD 环境无缝集成。

Red team your LLM application for 40+ safety vulnerabilities in a few lines of code, including:

Toxicity

毒性 (Toxicity)

使用几行代码对您的 LLM 应用程序进行红队测试，以发现 40 多个安全漏洞，包括：毒性 (Toxicity)

Bias

SQL Injection

SQL 注入 (SQL Injection)

etc., using advanced 10+ attack enhancement strategies such as prompt injections.

等等，使用高级的 10 多种攻击增强策略，例如提示注入 (prompt injections)。

Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., which includes:

MMLU

在不到 10 行代码中轻松地在流行的 LLM 基准测试 (LLM benchmarks) 上基准测试任何 LLM，包括： MMLU (MMLU)

HellaSwag

HellaSwag (HellaSwag)

DROP

BIG-Bench Hard

BIG-Bench Hard (BIG-Bench Hard)

TruthfulQA

TruthfulQA (TruthfulQA)

HumanEval

HumanEval (HumanEval)

GSM8K

100% integrated with Confident AI for the full evaluation & observability lifecycle:

Curate/annotate evaluation datasets on the cloud

100% 与 Confident AI 集成，用于完整的评估和可观察性生命周期：在云端管理/注释评估数据集

Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best

Fine-tune metrics for custom results

Debug evaluation results via LLM traces

Monitor & evaluate LLM responses in product to improve datasets with real-world data

Repeat until perfection

[!NOTE] DeepEval is available on Confident AI, an LLM evals platform for AI observability and quality. Create an account here.

[!NOTE] DeepEval 在 Confident AI 上可用，这是一个用于 AI 可观察性和质量的 LLM 评估平台。在此处创建一个帐户。

🔌 Integrations

🔌 集成 (Integrations)

🦄 LlamaIndex, to unit test RAG applications in CI/CD

🦄 LlamaIndex，用于在 CI/CD（持续集成/持续交付）中进行 RAG（检索增强生成）应用程序的单元测试。

🤗 Hugging Face, to enable real-time evaluations during LLM fine-tuning

🤗 Hugging Face，以在 LLM（大型语言模型）微调期间启用实时评估。

🚀 QuickStart

🚀 快速开始 (QuickStart)

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

假设您的 LLM（大型语言模型）应用程序是一个基于 RAG（检索增强生成）的客户支持聊天机器人；以下是 DeepEval 如何帮助测试您构建的内容。

Installation

安装 (Installation)

Deepeval works with Python>=3.9+.

Deepeval 适用于 Python>=3.9+。

pip install -U deepeval

Create an account (highly recommended)

创建一个帐户（强烈推荐）

Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.

使用 deepeval 平台将允许您在云上生成可共享的测试报告。它是免费的，不需要额外的代码来设置，我们强烈建议您尝试一下。

To login, run:

要登录，请运行：

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).

按照 CLI（命令行界面）中的说明创建一个帐户，复制您的 API 密钥，并将其粘贴到 CLI 中。所有测试用例都将自动记录（在此处查找有关数据隐私的更多信息）。

Writing your first test case

编写您的第一个测试用例

Create a test file:

创建一个测试文件：

touch test_chatbot.py

Open test_chatbot.py and write your first test case to run an end-to-end evaluation using DeepEval, which treats your LLM app as a black-box:

打开 test_chatbot.py 并编写您的第一个测试用例，以使用 DeepEval 运行端到端评估，DeepEval 将您的 LLM（大型语言模型）应用程序视为一个黑盒：

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):

将您的 OPENAI_API_KEY 设置为环境变量（您也可以使用您自己的自定义模型进行评估，有关更多详细信息，请访问我们文档的这一部分）：

export OPENAI_API_KEY="..."

And finally, run test_chatbot.py in the CLI:

最后，在 CLI（命令行界面）中运行 test_chatbot.py：

deepeval test run test_chatbot.py

Congratulations! Your test case should have passed ✅ Let's breakdown what happened.

恭喜！您的测试用例应该已经通过 ✅ 让我们分解一下发生了什么。

The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.

变量 input 模仿用户输入，而 actual_output 是应用程序应该基于此输入输出的内容的占位符。

The variable expected_output represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for you to evaluate your LLM output's on any custom with human-like accuracy.

变量 expected_output 表示给定输入的理想答案，而 GEval 是 deepeval 提供的基于研究支持的指标，用于评估您的 LLM（大型语言模型）输出在任何具有类人准确性的自定义上的表现。

In this example, the metric criteria is correctness of the actual_output based on the provided expected_output.

在此示例中，指标标准是基于提供的 expected_output 的 actual_output 的正确性。

All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.

所有指标分数范围从 0 到 1，threshold=0.5 阈值最终决定了您的测试是否通过。

Read our documentation for more information on more options to run end-to-end evaluation, how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.

阅读我们的文档以获取有关运行端到端评估的更多选项、如何使用其他指标、创建您自己的自定义指标以及有关如何与其他工具（如 LangChain 和 LlamaIndex）集成的教程。

Evaluating Nested Components

评估嵌套组件

If you wish to evaluate individual components within your LLM app, you need to run component-level evals - a powerful way to evaluate any component within an LLM system.

如果您希望评估 LLM（大型语言模型）应用程序中的各个组件，您需要运行组件级评估 - 这是一种强大的方式来评估 LLM 系统中的任何组件。

Simply trace "components" such as LLM calls, retrievers, tool calls, and agents within your LLM application using the @observe decorator to apply metrics on a component-level. Tracing with deepeval is non-instrusive (learn more here) and helps you avoid rewriting your codebase just for evals:

只需使用 @observe 装饰器跟踪 LLM（大型语言模型）调用、检索器、工具调用和代理等“组件”，以便在组件级别应用指标。使用 deepeval 进行跟踪是非侵入性的（在此处了解更多信息），并有助于您避免仅为了评估而重写代码库：

from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
from deepeval.metrics import GEval
from deepeval import evaluate

correctness = GEval(name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])

@observe(metrics=[correctness])
def inner_component():
    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.
    update_current_span(test_case=LLMTestCase(input="...", actual_output="..."))
    return

@observe
def llm_app(input: str):
    inner_component()
    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])

You can learn everything about component-level evaluations here.

您可以在此处了解有关组件级评估的所有信息。

Evaluating Without Pytest Integration

在没有 Pytest 集成的情况下进行评估

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

或者，您可以在没有 Pytest 的情况下进行评估，这更适合笔记本环境。

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])

Using Standalone Metrics

使用独立指标

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

DeepEval 非常模块化，使任何人都可以轻松使用我们的任何指标。从前面的例子继续：

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra costs.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)

answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# All metrics also offer an explanation
print(answer_relevancy_metric.reason)

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

请注意，某些指标适用于 RAG（检索增强生成）管道，而另一些指标适用于微调。确保使用我们的文档来为您的用例选择正确的指标。

Evaluating a Dataset / Test Cases in Bulk

批量评估数据集/测试用例

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

在 DeepEval 中，数据集只是测试用例的集合。以下是如何批量评估这些：

import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like today?")])

for golden in dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=your_llm_app(golden.input)
    )
    dataset.add_test_case(test_case)

@pytest.mark.parametrize(
    "test_case",
    dataset.test_cases,
)
def test_customer_chatbot(test_case: LLMTestCase):
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [answer_relevancy_metric])

# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_<filename>.py -n 4

Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:

或者，虽然我们建议使用 deepeval 测试运行，但您可以在不使用我们的 Pytest 集成的情况下评估数据集/测试用例：

from deepeval import evaluate
...

evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])

A Note on Env Variables (.env / .env.local)

关于 Env 变量的说明 (.env / .env.local)

DeepEval auto-loads .env.local then .env from the current working directory at import time. Precedence: process env -> .env.local -> .env. Opt out with DEEPEVAL_DISABLE_DOTENV=1.

DeepEval 在导入时从当前工作目录自动加载 .env.local，然后加载 .env。优先级：process env -> .env.local -> .env。使用 DEEPEVAL_DISABLE_DOTENV=1 退出。

cp .env.example .env.local
# then edit .env.local (ignored by git)

DeepEval With Confident AI

DeepEval 与 Confident AI 配合使用

DeepEval is available on Confident AI, an evals & observability platform that allows you to:

DeepEval可在Confident AI上使用，这是一个评估和可观测性平台，允许您：

Curate/annotate evaluation datasets on the cloud

在云端管理/注释评估数据集

Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best

使用数据集对 LLM 应用程序进行基准测试，并与之前的迭代进行比较，以实验哪种模型/提示效果最佳

Fine-tune metrics for custom results

微调指标以获得自定义结果

Debug evaluation results via LLM traces

通过 LLM 跟踪调试评估结果

Monitor & evaluate LLM responses in product to improve datasets with real-world data

监控和评估产品中的 LLM（大型语言模型）响应，以使用真实世界的数据改进数据集。

Repeat until perfection

重复直到完美。

Everything on Confident AI, including how to use Confident is available here.

Confident AI上的所有内容，包括如何使用Confident，都可以在这里找到。

To begin, login from the CLI:

首先，从CLI登录：

deepeval login

Follow the instructions to log in, create your account, and paste your API key into the CLI.

按照说明登录，创建您的帐户，并将您的API密钥粘贴到CLI中。

Now, run your test file again:

现在，再次运行您的测试文件：

deepeval test run test_chatbot.py

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

测试运行完成后，您应该在CLI中看到一个链接。将其粘贴到您的浏览器中以查看结果！

Demo GIF

Configuration

配置 (Configuration)

Environment variables via .env files

通过.env文件设置环境变量

Using .env.local or .env is optional. If they are missing, DeepEval uses your existing environment variables. When present, dotenv environment variables are auto-loaded at import time (unless you set DEEPEVAL_DISABLE_DOTENV=1).

使用.env.local或.env是可选的。如果它们缺失，DeepEval将使用您现有的环境变量。如果存在，dotenv环境变量会在导入时自动加载（除非您设置DEEPEVAL_DISABLE_DOTENV=1）。

Precedence: process env -> .env.local -> .env

优先级：进程环境变量 (process env) -> .env.local -> .env

cp .env.example .env.local
# then edit .env.local (ignored by git)

Contributing

贡献 (Contributing)

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

请阅读CONTRIBUTING.md，了解我们的行为准则以及向我们提交pull request的流程详情。

Roadmap

路线图 (Roadmap)

Features:

功能：

Integration with Confident AI

与Confident AI集成

Implement G-Eval

实现G-Eval

Implement RAG metrics

实现RAG指标

Implement Conversational metrics

实现会话指标 (Conversational metrics)

Evaluation Dataset Creation

评估数据集创建 (Evaluation Dataset Creation)

Red-Teaming

红队测试 (Red-Teaming)

DAG custom metrics

DAG自定义指标

Guardrails

护栏 (Guardrails)

Authors

Built by the founders of Confident AI. Contact [email protected] for all enquiries.

由Confident AI的创始人构建。如有任何疑问，请联系[email protected]。

License

许可 (License)

DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.

DeepEval是在Apache 2.0许可下发布的 - 有关详细信息，请参见LICENSE.md文件。

深度加工（NotebookLM 生成）

基于本文内容生成的 PPT 大纲、博客摘要、短视频脚本与 Deep Dive 播客，用于多场景复用

PPT 大纲（5-8 张幻灯片）点击展开

DeepEval: The Open-Source LLM Evaluation Framework — ppt

这是一份基于您提供的 DeepEval 文档生成的 PPT 大纲。大纲包含 6 张幻灯片，采用 Markdown 格式编写。

幻灯片 1：DeepEval 简介与核心定位

什么是 DeepEval：一个开源且易于使用的大语言模型（LLM）评估框架，专为评估和测试 LLM 系统而设计 [1]。
测试理念：它类似于软件开发中的 Pytest，但专门用于对 LLM 的输出进行单元测试 [1]。
适用场景广泛：无论是 AI 智能体（Agents）、检索增强生成（RAG）管道，还是基于 LangChain 或 OpenAI 构建的聊天机器人，都能全面覆盖 [1]。
灵活的评估引擎：融合最新研究，支持使用“LLM作为裁判（LLM-as-a-judge）”或在本地运行的其他 NLP 模型来进行评估 [1]。

幻灯片 2：核心评估指标体系

RAG 专属指标：提供答案相关性（Answer Relevancy）、事实一致性（Faithfulness）、上下文召回率及精确度等现成指标 [2]。
智能体与对话指标：支持评估任务完成度、工具调用正确性，以及对话的完整性、知识保留度和角色遵循度 [2]。
安全与质量指标：能够有效检测模型输出的幻觉（Hallucination）、总结质量、偏见和毒性内容 [2]。
高度可定制：用户不仅可以利用任何自己选择的 LLM 来驱动评估，还能构建自动集成到 DeepEval 生态中的自定义指标 [2]。

幻灯片 3：灵活的测试与追踪模式

端到端黑盒测试：只需提供模拟输入（Input）和预期输出（Expected Output），即可通过设定的阈值（Threshold）直接评估 LLM 系统的整体表现 [3, 4]。
组件级精细评估：通过 @observe 装饰器，可以非侵入式地追踪和评估应用内部的特定组件（如检索器、工具调用、单次 LLM 交互） [5, 6]。
无缝的环境兼容：除了标准的命令行运行（deepeval test run），还支持在 Jupyter Notebook 等无 Pytest 环境中直接评估测试用例 [4, 6, 7]。
批量测试与数据集：支持将多个测试用例打包为评估数据集（EvaluationDataset），便于利用 Pytest 框架进行批量并发测试 [7, 8]。

幻灯片 4：红队测试与基准测试能力

自动红队测试（Red Teaming）：仅需几行代码，即可扫描并发现 LLM 应用中 40 多种安全漏洞（如 SQL 注入、毒性内容等） [9]。
高级攻击模拟：利用提示词注入等 10 余种高级攻击增强策略，帮助开发者测试应用的鲁棒性 [9]。
主流基准测试（Benchmarking）：通过不到 10 行代码，即可在 MMLU、HellaSwag、HumanEval、GSM8K 等知名学术基准上对任何 LLM 进行跑分 [9]。
生成合成数据：框架支持自动生成用于评估的合成数据集，大幅降低测试数据的人工标注成本 [9]。

幻灯片 5：平台级可观测性（Confident AI）

完整的评估生命周期：DeepEval 100% 集成到 Confident AI 云平台，覆盖了从数据集管理到生产监控的全流程 [9, 10]。
版本对比与实验：可以在云端使用数据集对 LLM 应用进行基准测试，并直观对比不同模型或提示词的迭代效果 [9, 10]。
生产环境监控：不仅能调试通过 LLM 追踪发现的问题，还能在产品实际运行中监控和评估 LLM 响应，利用真实数据不断改进系统 [9, 10]。
自动生成报告：测试完成后无需额外代码，即可在云端生成并分享详细的测试报告 [11]。

幻灯片 6：快速上手与生态集成

安装简便：支持 Python 3.9+，通过简单的 pip install -U deepeval 即可完成安装 [11]。
丰富的框架集成：原生集成 LlamaIndex 以便在 CI/CD 中测试 RAG，同时支持 Hugging Face 实现微调时的实时评估 [11]。
环境变量管理：自动从当前目录加载 .env.local 和 .env 文件以获取 API 密钥（如 OpenAI API），配置过程顺畅自然 [3, 10, 12]。
CI/CD 无缝接入：作为轻量级框架，它可以无缝整合进各类 CI/CD 环境，为 LLM 应用构建可靠的自动化测试流水线 [9]。

博客摘要 + 核心看点点击展开

DeepEval: The Open-Source LLM Evaluation Framework — summary

SEO 友好博客摘要：

DeepEval 是一款专为大语言模型（LLM）系统设计的开源评估框架，让开发者能像使用 Pytest 一样轻松进行 LLM 单元测试 [1]。它结合最新研究成果，利用本地 NLP 模型或“LLM 作为裁判（LLM-as-a-judge）”技术，对 RAG 管道、AI 智能体和聊天机器人进行高效评估 [1]。该框架内置了数十种开箱即用的指标（包含 G-Eval、答案相关性和幻觉检测等），全面支持端到端及模型内部组件级别的评估 [1, 2]。此外，DeepEval 能无缝集成 CI/CD 环境，支持针对提示词注入等超过 40 种安全漏洞的红队测试 [3]。配合 Confident AI 平台，开发者更可实现完整的评估与可观测性生命周期管理 [3]。

3 条核心看点：