大模型测评从入门到精通 - 初核心概念

2026/6/23 5:14:06

文章目录核心概念 - 像专家一样思考2.1 测试用例 (Test Case) —— 评估的原子单位单轮 vs 多轮测试用例2.2 指标 (Metric) —— 你的评估标尺2.3 数据集 (Dataset) —— 测试的弹药库2.4 评估方式 —— 三种运行模式方式一使用 assert_test (Pytest 风格)方式二使用 evaluate 函数方式三独立运行指标2.5 核心架构图核心概念 - 像专家一样思考大半夜的加完班更一贴吧希望大家都像标题一样像专家一样思考莫给队友添负担2.1 测试用例 (Test Case) —— 评估的原子单位在 DeepEval 中测试用例 (Test Case)是评估的最小单位。想象它是一份试卷包含小白划重点把LLMTestCase想象成一个字典/JSON 对象里面装着评估需要的所有信息。DeepEval 会拿这些信息去问评判 LLM“嘿看看这个回答怎么样”题目(input): 用户的问题或指令学生答案(actual_output): 你的 LLM 应用的输出标准答案(expected_output): 期望的理想回答可选参考资料(context): 提供给 LLM 的背景知识可选检索结果(retrieval_context): RAG 系统检索到的文档可选工具调用(tools_called): Agent 调用的工具列表可选期望工具(expected_tools): 期望 Agent 调用的工具可选fromdeepeval.test_caseimportLLMTestCase,ToolCall test_caseLLMTestCase(inputWhat if these shoes dont fit?,# 用户问题expected_outputYoure eligible for a 30 day refund.,# 期望回答actual_outputWe offer a 30-day full refund at no extra cost.,# 实际回答context[All customers are eligible for a 30 day full refund at no extra cost.],# 上下文retrieval_context[Only shoes can be refunded.],# 检索结果tools_called[ToolCall(nameWebSearch)]# 调用的工具)单轮 vs 多轮测试用例fromdeepeval.test_caseimportLLMTestCase,ConversationalTestCase,LLMTestCase# 单轮测试一问一答single_turnLLMTestCase(inputWhats the weather today?,actual_outputIts sunny and 25°C.)# 多轮测试完整对话multi_turnConversationalTestCase(scenarioCustomer asking about return policy,turns[LLMTestCase(inputHi, I want to return my shoes,actual_outputSure, I can help with that...),LLMTestCase(inputHow long do I have?,actual_outputYou have 30 days...)])LLMTestCase是评估的基本单元代表一次 LLM 交互参数类型必需说明inputstr✅发送给 LLM 的输入actual_outputstr✅LLM 生成的输出expected_outputstr❌期望输出参考标准retrieval_contextList[str]❌RAG 检索上下文contextList[str]❌额外背景信息tools_calledList[ToolCall]❌Agent 调用的工具expected_toolsList[str]❌期望调用的工具# 官网给的 demo 附加上吧fromdeepeval.test_caseimportLLMTestCase test_caseLLMTestCase(input美国的现任总统是谁,actual_output乔·拜登是美国现任总统。,expected_output乔·拜登,retrieval_context[乔·拜登目前担任美国总统。],)2.2 指标 (Metric) —— 你的评估标尺指标是评估的标准。DeepEval 提供 50 指标分为几大类┌─────────────────────────────────────────────────────────────┐ │ DeepEval 指标家族 │ ├─────────────────────────────────────────────────────────────┤ │ RAG 指标 │ Faithfulness, Answer Relevancy, │ │ │ Contextual Precision/Recall/Relevancy│ ├─────────────────────────────────────────────────────────────┤ │ Agent 指标 │ Task Completion, Tool Correctness, │ │ │ Step Efficiency, Plan Adherence │ ├─────────────────────────────────────────────────────────────┤ │ 对话指标 │ Conversation Relevancy, Knowledge │ │ │ Retention, Role Adherence │ ├─────────────────────────────────────────────────────────────┤ │ ️ 安全指标 │ Bias, Toxicity, PII Leakage, │ │ │ Misuse, Non-Advice │ ├─────────────────────────────────────────────────────────────┤ │ ⚙️ 通用指标 │ Hallucination, Summarization, │ │ │ JSON Correctness, Ragas │ ├─────────────────────────────────────────────────────────────┤ │ 自定义指标 │ G-Eval, DAGMetric │ └─────────────────────────────────────────────────────────────┘每个指标返回score: 0-1 之间的分数reason: 评分的理由说明success: 是否通过阈值 (score threshold)小白划重点score 是什么就像考试分数0 分最差1 分最好。threshold 是什么及格线。比如 threshold0.7那分数 0.7 才算通过。reason 是什么LLM 给出的评语告诉你为什么给这个分数。这在调试时超级有用示例输出 Score: 0.85 Reason: The response directly answers the users question about return policy and provides accurate information consistent with the context. Success: True常见坑点⚠️坑 1: 以为 score 是百分比 → 其实是 0-1 的小数0.85 85%坑 2: threshold 设得太高 → 建议从 0.5-0.7 开始逐步调整坑 3: 忽略 reason → reason 是调试神器一定要看2.3 数据集 (Dataset) —— 测试的弹药库数据集是测试用例的集合。你可以fromdeepeval.datasetimportEvaluationDataset,Golden# 创建数据集datasetEvaluationDataset(goldens[Golden(inputWhat is DeepEval?,expected_outputAn LLM evaluation framework.),Golden(inputHow to install?,expected_outputpip install deepeval)])# 从 CSV 加载dataset.add_goldens_from_csv_file(file_pathtest_data.csv)# 从 JSON 加载dataset.add_goldens_from_json_file(file_pathtest_data.json)# 从 Confident AI 云端拉取dataset.pull(aliasMy Production Dataset)2.4 评估方式 —— 三种运行模式方式一使用assert_test(Pytest 风格)fromdeepevalimportassert_testfromdeepeval.test_caseimportLLMTestCasefromdeepeval.metricsimportAnswerRelevancyMetricdeftest_answer_relevancy():metricAnswerRelevancyMetric(threshold0.7)test_caseLLMTestCase(inputWhat is DeepEval?,actual_outputDeepEval is an open-source LLM evaluation framework.)assert_test(test_case,[metric])运行deepeval test run test_file.py方式二使用evaluate函数fromdeepevalimportevaluatefromdeepeval.test_caseimportLLMTestCasefromdeepeval.metricsimportAnswerRelevancyMetric,FaithfulnessMetric test_cases[LLMTestCase(inputQ1,actual_outputA1),LLMTestCase(inputQ2,actual_outputA2)]metrics[AnswerRelevancyMetric(),FaithfulnessMetric()]resultsevaluate(test_casestest_cases,metricsmetrics)方式三独立运行指标fromdeepeval.metricsimportAnswerRelevancyMetricfromdeepeval.test_caseimportLLMTestCase metricAnswerRelevancyMetric()test_caseLLMTestCase(inputQ1,actual_outputA1)metric.measure(test_case)print(fScore:{metric.score})print(fReason:{metric.reason})2.5 核心架构图YesNo用户输入 input你的 LLM 应用实际输出 actual_output参考资料 context检索结果 retrieval_context测试用例 LLMTestCase评估指标 Metric分数 Score 0-1理由 Reason是否通过 Success threshold?✅ 测试通过❌ 测试失败简单过了遍官网内容感觉核心的东西就这么多就是怎么展开去用的问题和 pytest 一样了解 pytest 应该会感觉这个还好比较容易通

大模型测评从入门到精通 - 初核心概念

最新新闻

日新闻

周新闻

月新闻

相关新闻

Android自动化测试终极指南：从JUnit、Espresso到UI Automator实战

Playwright自动化测试：列表拖拽排序的实战指南与避坑技巧

C++Builder 6串口发送完整可运行工程：含界面、通信逻辑与资源文件

最新新闻

日新闻

周新闻

月新闻