Lab 3.5

Agent Evaluation & Benchmarking

75 min 11 sections 3 prerequisites

What You'll Learn

Build a golden test set for agent evaluation
Define measurable success criteria and scoring rubrics
Run systematic evaluations and measure consistency
Document failure modes and propose improvements

Prerequisites

2.1-process-analysis
2.2-task-decomposition
lab-3-workflow-mapping

Part 1 of 11

Lab Overview

You've designed AI workflows. But how do you know if they actually work?

"It seems good" isn't good enough. This lab teaches you to systematically evaluate AI agents with the same rigor used by leading AI labs—but adapted for business applications.

What you'll create:

A golden test set with input-output pairs
Evaluation criteria and scoring rubrics
Consistency measurements across multiple runs
Failure mode documentation with improvement recommendations

Part 2 of 11

Why Evaluation Matters

"If you can't measure it, you can't improve it."

AI evaluation differs from traditional software testing:

Traditional Software	AI Applications
Given input X, always produces output Y	Same input may produce different outputs
Bugs are deterministic	Failures may be subtle (wrong tone, missed nuance)
Edge cases are finite	Edge cases are infinite
Pass/fail is binary	Quality is often subjective

Without systematic evaluation, you're flying blind.

Part 3 of 11

The Agent Evaluation Matrix

Before testing, understand what you're measuring:

Dimension	What It Tests	Key Questions
Task Success	Goal completion	Did it do what was asked?
Output Quality	Result correctness	Is the output accurate and useful?
Tool Use	Capability selection	Did it use the right tools correctly?
Reasoning	Decision quality	Was the logic sound?
Safety	Boundary respect	Did it stay within scope?
Reliability	Consistency	Does it work the same way each time?

Part 4 of 11

Part 1: Define Success Criteria (20 minutes)

Step 1: Choose Your Evaluation Target

Select an AI workflow to evaluate. This could be:

An existing AI assistant or chatbot
A workflow you designed in Lab 3
A prompt chain you've built
A customer support automation

Document your target:

EVALUATION TARGET
─────────────────
System name: ________________________________
Purpose: ____________________________________
Key tasks it performs:
1. _________________________________________
2. _________________________________________
3. _________________________________________

Step 2: Define Success Criteria

For each key task, define what "success" looks like:

Example for a customer support agent:

Task	Success Criteria	Measurement
Answer product questions	Factually correct, references documentation	Manual verification against source
Handle refund requests	Follows policy, captures required info	Checklist completion rate
Escalate complex issues	Recognizes triggers, routes correctly	Escalation accuracy rate

Your turn—define 5 success criteria:

SUCCESS CRITERIA
────────────────
Task 1: ____________________
  Success means: ____________________
  Measured by: ____________________

Task 2: ____________________
  Success means: ____________________
  Measured by: ____________________

Task 3: ____________________
  Success means: ____________________
  Measured by: ____________________

Task 4: ____________________
  Success means: ____________________
  Measured by: ____________________

Task 5: ____________________
  Success means: ____________________
  Measured by: ____________________

Step 3: Create a Scoring Rubric

Convert subjective criteria to scores:

Example rubric for "response quality":

Score	Description	Indicators
5	Excellent	Complete, accurate, well-structured, appropriate tone
4	Good	Mostly complete, minor omissions, good tone
3	Acceptable	Addresses main point, some gaps, adequate tone
2	Poor	Incomplete or partially incorrect, tone issues
1	Failing	Wrong answer, inappropriate, or harmful

Create rubrics for your top 3 criteria:

Part 5 of 11

Part 2: Build Your Test Set (25 minutes)

Golden Examples (10 test cases)

Golden examples are input-output pairs where you know what "good" looks like.

Structure:

TEST CASE #___
──────────────
Input: [What the user says/submits]
Expected output: [What a good response includes]
Success criteria: [Which criteria this tests]
Evaluation notes: [What to look for when scoring]

Create 10 golden examples:

Test Cases 1-4: Happy Path Standard requests that should work perfectly.

TEST CASE #1 (Happy Path)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #2 (Happy Path)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #3 (Happy Path)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #4 (Happy Path)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

Test Cases 5-7: Edge Cases Ambiguous, unusual, or boundary-testing inputs.

TEST CASE #5 (Edge Case - Ambiguous)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #6 (Edge Case - Unusual)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #7 (Edge Case - Boundary)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

Test Cases 8-9: Adversarial Attempts to confuse, manipulate, or break the system.

TEST CASE #8 (Adversarial - Confusion)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

TEST CASE #9 (Adversarial - Manipulation)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

Test Case 10: Out of Scope Request the agent should refuse or escalate.

TEST CASE #10 (Out of Scope)
Input: ________________________________________________
Expected: _____________________________________________
Tests: ________________________________________________

Part 6 of 11

Part 3: Run Evaluation (20 minutes)

Execute Your Test Suite

Run each test case through your AI system. For each response:

Record the actual output
Score against your rubric
Note any unexpected behaviors

Evaluation Log Template:

TEST CASE #___
──────────────
Input: [copy from test set]

Actual Output:
______________________________________________
______________________________________________
______________________________________________

Scores:
  Criterion 1 (________): ___ / 5
  Criterion 2 (________): ___ / 5
  Criterion 3 (________): ___ / 5

Overall: ___ / 5

Notes:
______________________________________________

Measure Consistency

Run 3 of your test cases three times each. Record variation:

CONSISTENCY CHECK
─────────────────
Test Case #___:
  Run 1 score: ___
  Run 2 score: ___
  Run 3 score: ___
  Variance: ___
  Notes on differences: _______________________

Test Case #___:
  Run 1 score: ___
  Run 2 score: ___
  Run 3 score: ___
  Variance: ___
  Notes on differences: _______________________

Test Case #___:
  Run 1 score: ___
  Run 2 score: ___
  Run 3 score: ___
  Variance: ___
  Notes on differences: _______________________

Part 7 of 11

Part 4: Document Failure Modes (10 minutes)

Failure Categorization

Review your results and categorize any failures:

Failure Type	Description	Your Examples
Wrong answer	Factually incorrect output
Incomplete	Missing key information
Wrong tool	Used inappropriate capability
Scope violation	Acted outside boundaries
Tone mismatch	Inappropriate communication style
Hallucination	Made up information
Inconsistency	Different answers to same input

Pattern Analysis

Look for patterns in your failures:

FAILURE PATTERNS
────────────────
Most common failure type: _______________________
Triggers for failures: __________________________
Percentage of tests with issues: _____%

Root cause hypotheses:
1. ____________________________________________
2. ____________________________________________
3. ____________________________________________

Improvement Recommendations

Based on your analysis, what changes would improve performance?

RECOMMENDED IMPROVEMENTS
────────────────────────
Priority 1 (Critical):
  Problem: _____________________________________
  Solution: ____________________________________
  Expected impact: _____________________________

Priority 2 (Important):
  Problem: _____________________________________
  Solution: ____________________________________
  Expected impact: _____________________________

Priority 3 (Nice to have):
  Problem: _____________________________________
  Solution: ____________________________________
  Expected impact: _____________________________

Part 8 of 11

Lab Deliverable

Compile your work into an Agent Evaluation Report:

AGENT EVALUATION REPORT
═══════════════════════

System Evaluated: _______________________________
Date: __________________________________________
Evaluator: _____________________________________

EXECUTIVE SUMMARY
─────────────────
Overall score: ___ / 5
Tests passed: ___ / 10
Consistency score: ___ (low variance = good)
Production ready: Yes / No / With changes

KEY METRICS
───────────
Task success rate: ____%
Average quality score: ___ / 5
Failure rate: ____%
Top failure type: _______________

DETAILED FINDINGS
─────────────────
Strengths:
• ___________________________________________
• ___________________________________________

Weaknesses:
• ___________________________________________
• ___________________________________________

RECOMMENDATIONS
───────────────
1. ___________________________________________
2. ___________________________________________
3. ___________________________________________

APPENDIX
────────
[Attach your test cases and evaluation logs]

Part 9 of 11

Industry Benchmarks Reference

For context, here's how leading benchmarks evaluate agents:

Benchmark	What It Tests	Current Best Performance
AgentBench	8 environments (OS, web, DB)	~45% average
WebArena	812 web tasks	~60% (humans: 78%)
GAIA	466 reasoning + tool tasks	Varies by difficulty
ToolEmu	Safety in high-stakes scenarios	Focus on failure prevention

Your business evaluation doesn't need to match these formats—but understanding them helps you think systematically.

Part 10 of 11

Reflection Questions

What surprised you most about your evaluation results?
How would you explain these results to a non-technical stakeholder?
What's the minimum acceptable score for production deployment?
How often should you re-run this evaluation?
What test cases would you add for the next evaluation cycle?

Part 11 of 11

Next Steps

After completing this lab:

Lab 4: Design quality gates based on your evaluation criteria
Module 3.4: Learn deployment strategies that incorporate evaluation
Lab 5: Build an assistant with evaluation built-in

Ready to Complete?

You've viewed 0 of 11 sections. Mark this lab as complete when you're done.