Agent Harness Eval Plan

Use this worksheet before releasing an AI agent that can call tools, modify state, or hand work to other systems.

1. Agent Scope

Question Answer
What job is the agent responsible for?
What jobs are explicitly out of scope?
Which tools can it call?
Which tools can change state?
When must it ask a human?

2. Tool Boundaries

Tool Allowed Use Not Allowed Required Checks

3. Eval Dataset

Start with 30 to 50 representative examples. Include normal work, ambiguous work, and hostile or risky inputs.

Case Type Count Example Source
Happy path
Ambiguous request
Missing context
Tool failure
Unsafe or out-of-scope request
Recovery after mistake

4. Pass Criteria

Criterion Passing Behavior
Task completion
Tool selection
State-changing action safety
Escalation judgment
Explanation quality
Cost and latency

5. Failure Review

Do not only track total score. Sort failures by type.

Failure Category Examples Likely Fix
Wrong tool Tool descriptions or routing logic
Missing escalation Policy, prompt, or harness guardrail
Bad recovery State tracking or retry policy
Hallucinated action Tool confirmation and output validation
Too expensive or slow Model routing, caching, or narrower context

6. Release Gate

Gate Threshold Result
Critical failures 0 allowed
Escalation accuracy 95% or better
Tool-call success 98% or better
Cost per task Within budget
p95 latency Within target

7. Maintenance

Trigger Action
Model version changes Rerun the regression set
Tool API changes Add new tool-failure cases
User complaints increase Sample failures and update evals
New workflow added Create a new eval slice