给Agent压测，别瞎造请求，回放线上日志当样本

2026/6/30 3:25:05

压测 Agent 这事，我一开始拿一句"你好"循环打几千遍，跑出来 QPS 漂亮得很，结果真上线被真实流量打趴了。问题出在样本——真实用户的问题长短不一、有的触发工具有的不触发、有的命中缓存有的不命中，用一句固定问候压，压的是个假场景。后来我改成回放线上日志，压测结果才跟生产对得上。

为什么固定样本测不准

Agent 的耗时和成本，跟"问题本身"强相关：

短问题模型生成快，长问题（带长文档的 RAG）慢好几倍；
触发工具调用的请求，要多走一次模型推理 + 工具往返，延迟翻倍；
重复问题可能命中缓存，毫秒返回，混进样本会把平均延迟拉得虚低。

你拿单一样本压，要么全是快的、要么全是慢的，压出来的数字没法指导容量规划。

做法：脱敏后回放真实流量

把线上一段时间的请求日志捞出来，脱敏（手机号、姓名这些必须抹掉），做成压测样本池，按真实比例回放。

import json, random, asyncio, time # 1. 从日志导出样本（已脱敏），保留问题文本和它当时的特征 samples = [json.loads(l) for l in open("traffic_sample.jsonl")] # 每条形如 {"msg": "怎么退货", "hit_tool": false, "tokens": 120} async def replay(concurrency=50, duration=60): sem = asyncio.Semaphore(concurrency) latencies, errors = [], 0 end = time.time() + duration async def one(): nonlocal errors async with sem: s = random.choice(samples) # 按真实分布抽 t0 = time.time() try: await call_agent(s["msg"]) latencies.append(time.time() - t0) except Exception: errors += 1 tasks = [] while time.time() < end: tasks.append(asyncio.create_task(one())) await asyncio.sleep(1 / concurrency) await asyncio.gather(*tasks) report(latencies, errors)