上版閘門門檻

BenQ Arm Chatbot (bulk R1) — 設定每個 metric 的通過/失敗下限。空白 = 無門檻。

每個 metric 一個數字。上版閘門透過 EvalMetric::DIRECTIONS 查方向，並用 ≥ (越大越好) 或 ≤ (越小越好) 比對。

01

檢索——知識

knowledges_precision

↑ 越高越好——設定下限

實際值 < 此值即不及格

knowledges_recall

↑ 越高越好——設定下限

實際值 < 此值即不及格

knowledges_f1

↑ 越高越好——設定下限

實際值 < 此值即不及格

knowledges_relevance

↑ 越高越好——設定下限

實際值 < 此值即不及格

02

檢索——產品

products_precision

↑ 越高越好——設定下限

實際值 < 此值即不及格

products_recall

↑ 越高越好——設定下限

實際值 < 此值即不及格

products_f1

↑ 越高越好——設定下限

實際值 < 此值即不及格

products_relevance

↑ 越高越好——設定下限

實際值 < 此值即不及格

03

工具使用

tools_precision

↑ 越高越好——設定下限

實際值 < 此值即不及格

tools_recall

↑ 越高越好——設定下限

實際值 < 此值即不及格

tools_f1

↑ 越高越好——設定下限

實際值 < 此值即不及格

04

效能

scenario_precision

↑ 越高越好——設定下限

實際值 < 此值即不及格

scenario_recall

↑ 越高越好——設定下限

實際值 < 此值即不及格

scenario_f1

↑ 越高越好——設定下限

實際值 < 此值即不及格

rule_compliance

↑ 越高越好——設定下限

實際值 < 此值即不及格

first_sent_latency_ms

↓ 越低越好——設定上限

實際值 > 此值即不及格

total_execution_time_ms

↓ 越低越好——設定上限

實際值 > 此值即不及格

total_tokens

↓ 越低越好——設定上限

實際值 > 此值即不及格

total_cost_usd

↓ 越低越好——設定上限

實際值 > 此值即不及格

refused

↓ 越低越好——設定上限

實際值 > 此值即不及格

05

忠實度

hallucination_rate

↓ 越低越好——設定上限

實際值 > 此值即不及格

citation_grounding

↑ 越高越好——設定下限

實際值 < 此值即不及格

06

回答品質

answer_relevance

↑ 越高越好——設定下限

實際值 < 此值即不及格

answer_completeness

↑ 越高越好——設定下限

實際值 < 此值即不及格

answer_correctness

↑ 越高越好——設定下限

實際值 < 此值即不及格

取消