Testing and Deployment

18 min apply 4 sections
Step 1 of 4

WHY WHY This Matters

AI applications fail in surprising ways. Unlike traditional software with predictable bugs, AI can produce plausible-but-wrong outputs that look fine until they cause problems. Rigorous testing and careful deployment protect:

  • Your users from AI mistakes
  • Your reputation from embarrassing failures
  • Your organization from liability
  • Your resources from costly fixes

Ship fast, but ship responsibly.


Step 2 of 4

WHAT WHAT You Need to Know

AI-Specific Testing Challenges

Testing Strategies

1. Golden Dataset Testing Create a set of inputs with known-good outputs:

Input Expected Output Evaluation Criteria
[Test case 1] [Expected response] Matches key points, appropriate tone
[Test case 2] [Expected response] Includes required elements
[Edge case 1] [Expected handling] Graceful failure, escalation

2. Persona Testing Test with different user types:

  • Friendly user with clear request
  • Confused user with vague request
  • Hostile user testing boundaries
  • Technical user with complex needs
  • Non-native English speaker

3. Failure Mode Testing Deliberately try to break it:

  • Extremely long inputs
  • Empty or minimal inputs
  • Conflicting instructions
  • Off-topic requests
  • Prompt injection attempts

4. Consistency Testing Run same input multiple times:

  • Are outputs meaningfully similar?
  • Does quality vary unacceptably?
  • Are there occasional failures?

Agent Evaluation Frameworks

When testing AI agents (not just prompts), you need specialized evaluation approaches. Agents have additional dimensions of complexity that simple input/output testing doesn't capture.

The Agent Evaluation Matrix

Dimension What It Tests Key Metrics Example Failure
Task Success Goal completion Success rate, completion time Agent gives up prematurely
Output Quality Result correctness Accuracy, relevance scores Correct answer, wrong format
Tool Use Capability selection Tool accuracy, API efficiency Used search when calculator needed
Reasoning Decision quality Logic validity, step coherence Correct conclusion from wrong logic
Safety Boundary respect Policy compliance, escalation accuracy Acted beyond authorized scope
Reliability Consistency Variance across runs, failure rate Works 80% of the time

Industry Benchmarks Reference

Leading AI labs evaluate agents against standardized benchmarks. Understanding these helps you design your own evaluations:

Building Your Own Evaluation

Don't copy industry benchmarks—design evaluations for YOUR use case:

Step 1: Define success dimensions What does "good" mean for your agent? Rank these by importance:

  • Accuracy of final output
  • Efficiency (time, tokens, API calls)
  • User experience (response quality, helpfulness)
  • Safety (staying in bounds, escalating correctly)

Step 2: Create golden examples For each critical task, define:

GOLDEN EXAMPLE
├── Input: [What the agent receives]
├── Expected tool calls: [Which tools, in what order]
├── Expected reasoning: [Key decision points]
├── Expected output: [What success looks like]
└── Failure modes: [What would be wrong]

Step 3: Test systematically

Happy path      → Does it work when everything is normal?
Edge cases      → How does it handle unusual inputs?
Error recovery  → What happens when tools fail?
Adversarial     → Can users manipulate or confuse it?
Consistency     → Same input 5 times = same quality?

Step 4: Measure and iterate Track metrics over time. Improvements to prompts, tools, or orchestration should show up in your evaluation scores.

Evaluation Anti-Patterns

Anti-Pattern Problem Better Approach
Vibes-based eval "It seems good" Define specific metrics
Demo-only testing Cherry-picked examples Random sampling from real usage
Single-run scoring Ignores variance Multiple runs per test case
Output-only focus Misses process issues Evaluate reasoning and tool use
Static test sets Miss evolving issues Add failing cases to test set
Human-only review Doesn't scale Combine automated + human eval

LLM-as-Judge Pattern

Use AI to evaluate AI—with careful design:

EVALUATOR PROMPT STRUCTURE
├── Clear criteria: Exactly what to evaluate
├── Rubric: Specific scoring guidelines (1-5 scale)
├── Examples: What each score looks like
├── Output format: Structured (JSON) for parsing
└── Reasoning: Ask for explanation before score

Example rubric prompt:

Rate this customer support response on a 1-5 scale:

5: Fully addresses query, accurate, helpful, appropriate tone 4: Addresses query with minor gaps, mostly accurate 3: Partially addresses query, some inaccuracies 2: Misses key points, notable errors 1: Incorrect, unhelpful, or inappropriate

First explain your reasoning, then provide score as JSON: {"score": N}

Caution: LLM evaluators have biases (prefer verbose answers, struggle with domain expertise). Calibrate against human judgment.

Pre-Launch Checklist

Deployment Approaches

Approach 1: Shadow Mode

User request → Current system responds
            ↘ AI system also runs (not shown to user)
            → Compare outputs, evaluate AI readiness

Approach 2: Limited Rollout

Week 1: 5% of requests → AI
Week 2: 25% of requests → AI (if Week 1 OK)
Week 3: 100% of requests → AI (if Week 2 OK)

Approach 3: Human-in-the-Loop

AI generates → Human reviews → Approved outputs go live
            → Rejected outputs inform improvements

Approach 4: Internal First

Internal users test → Feedback incorporated → External users

Monitoring in Production

What to Monitor Why Red Flags
Response quality Catch degradation Complaints, low ratings
Latency User experience >5s response times
Error rate System health >1% API errors
Token usage Cost control Unexpected spikes
Escalation rate AI capability Rising escalations
User satisfaction Overall success Declining feedback

Iterative Improvement

After launch, establish feedback loops:

Production usage
      ↓
Collect feedback (automated + manual)
      ↓
Identify improvement opportunities
      ↓
Update prompts / configuration
      ↓
Test changes
      ↓
Deploy updates
      ↓
[Repeat]

Key Concepts

Key Concept

ai testing

AI testing differs from traditional software testing:

Traditional software:

  • Given input X, always produces output Y
  • Bugs are deterministic
  • Edge cases are finite

AI applications:

  • Same input may produce different outputs
  • Failures may be subtle (wrong tone, missed nuance)
  • Edge cases are infinite
  • Quality is often subjective
Key Concept

agent evaluation

Agent evaluation requires testing not just what the agent produces, but HOW it works:

  • Did it choose the right tools?
  • Was its reasoning sound?
  • Did it stay within boundaries?
  • Can it recover from errors?
  • Is it consistent across runs?
Key Concept

eval benchmarks

Industry Benchmarks for AI Agents:

Benchmark Focus What It Tests Current Best Performance
AgentBench General capability 8 environments (OS, web, database) ~45% average (humans: 80%+)
WebArena Web navigation 812 realistic web tasks ~35-60% (humans: 78%)
GAIA Real-world reasoning 466 multi-step problems Varies by difficulty level
ToolBench Tool calling 16,000+ tools, 49 categories ~70% for best models
BFCL Function calling Correct API invocation ~90% for top models

Key insight: Even the best models struggle with multi-step, real-world tasks. Set expectations accordingly.

Key Concept

pre launch

Before deploying an AI application:

Functional checks:

  • Core use cases work correctly
  • Edge cases handled gracefully
  • Error messages are helpful
  • Performance is acceptable

Quality checks:

  • Output quality meets standards
  • Tone and voice are appropriate
  • No hallucinations in test set
  • Sensitive topics handled correctly

Safety checks:

  • Prompt injection tested
  • Harmful request handling verified
  • PII handling appropriate
  • Escalation paths work

Operational checks:

  • API keys secured
  • Rate limits understood
  • Costs projected
  • Monitoring in place
Step 3 of 4

HOW HOW to Apply This

Exercise: Create a Test Plan

Testing Template

TEST PLAN: [Application Name]

1. GOLDEN DATASET
| ID | Input | Expected Output | Pass Criteria |
|----|-------|-----------------|---------------|
| G1 | ... | ... | ... |
| G2 | ... | ... | ... |

2. PERSONA TESTS
| Persona | Scenario | Expected Behavior |
|---------|----------|-------------------|
| P1 | ... | ... |
| P2 | ... | ... |

3. FAILURE MODE TESTS
| Mode | Input | Expected Handling |
|------|-------|-------------------|
| F1 | ... | ... |
| F2 | ... | ... |

4. SUCCESS METRICS
| Metric | Target | Measurement |
|--------|--------|-------------|
| ... | ... | ... |

5. DEPLOYMENT STAGES
| Stage | Users | Duration | Success Gate |
|-------|-------|----------|--------------|
| ... | ... | ... | ... |

Common Launch Failures

Failure Cause Prevention
Quality disaster Insufficient testing Comprehensive test plan
Cost overrun Token usage underestimated Load testing, cost projections
User confusion Poor UX Beta testing with real users
Security incident Prompt injection Security testing
Performance issues Scale not considered Load testing
Rollback chaos No plan Document rollback procedure

Self-Check


Practice Exercises

You've built an AI assistant that helps employees draft expense report justifications. Given expense details, it generates professional justifications suitable for finance approval.

Create a comprehensive test plan:

1. Golden dataset (5 cases): Write 5 test inputs and expected outputs:

  • Routine expense (lunch with client)
  • Large purchase (conference tickets)
  • Unusual expense (thank you gift for vendor)
  • Ambiguous expense (software subscription)
  • Edge case (personal expense mistakenly submitted)

2. Persona tests: How should the system respond to:

  • New employee unfamiliar with policies?
  • Executive with vague expense descriptions?
  • Someone trying to justify personal expenses?

3. Failure mode tests: What happens with:

  • Expense over policy limits?
  • Missing required information?
  • Potentially fraudulent patterns?

4. Success criteria: How will you measure if the system is working?

  • Quality metrics
  • Operational metrics
  • User satisfaction metrics

5. Deployment plan: How would you roll this out?

  • Which users first?
  • What checkpoints?
  • Rollback criteria?
Step 4 of 4

GENERIC Phase 3 Complete!

You've built your Implementation skills. Before moving to Phase 4, complete:

Lab 5: Build an AI Assistant — Create a functional AI assistant using no-code tools

Phase 3 Deliverable: Working AI Application — Build and deploy a functional AI-powered tool that solves a real business problem

Module Complete!

You've reached the end of this module. Review the checklist below to ensure you've understood the key concepts.

Progress Checklist

0/7
0% Complete
0/4 Sections
0/4 Concepts
0/1 Exercises