test

production env · 1 suites · 2 次完成 run

Subject № 319f5c68-8beb-4b8b-838a-2cb37c4ad6a1 PRODUCTION
評測狀態 · 正常維護

Eval suites 維護中,全部 in sync

2 scenarios · 0 KB items 1 suite
FLEET READY · ALL KINDS COVERED

20 cases · test (bulk R1)

kb_accuracy 4
scenario_funnel 10
mixed_qa 6
uncategorized 0
01

生命徵象

[KIND × DIMENSION] vital signs — this bot's per-dim clearance vs. its baseline
知識庫精準度 [—]
檢索 100.0% 100.0%
忠實度 100.0%
回答品質 96.7% 96.7%
情境調用與完成 [PASS]
情境 100.0% 100.0% ≥95.0% [±5pp] +5.0 ✓
工具使用 100.0% 100.0% ≥95.0% [±5pp] +5.0 ✓
回答品質 85.7% 84.7% ≥77.7% [±8pp] +7.0 ✓
對話素養(混合問答) [FAIL]
檢索 100.0% 100.0%
忠實度 33.3% 50.0% <70.0% [floor] -20.0 ✗
回答品質 86.1% 86.1% ≥78.1% [±8pp] +8.0 ✓
02

測試套件