Agent Testing

Creating effective AI agents requires thorough testing to ensure they provide accurate, helpful, and appropriate responses. Prisme.ai provides comprehensive testing capabilities to validate your agents before deployment and continuously improve them over time.

Testing Approaches

Prisme.ai supports multiple testing methodologies to ensure your agents meet your organization’s standards:

Manual Testing
Automated Evaluation
Human-in-the-Loop
Custom Evaluation

Direct interaction with the agent to assess its responses.

Evaluation Framework

Prisme.ai uses a straightforward evaluation system that makes it easy to assess agent performance:

Response Quality

Assesses how well the agent answers the question

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Context Quality

Evaluates how well the agent retrieved relevant information

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Hallucination Check

Identifies if the agent made up information

Score: 0 (Significant), 1 (Minor), 2 (None)

This simple three-point scale makes evaluation straightforward while providing meaningful insights into agent performance.

Automated Evaluation Process

The automated evaluation process uses LLMs as judges to assess agent performance:

Create Test Questions

Develop a set of representative questions that users might ask your agent.

Configure Evaluation Parameters

Set up the evaluation process by selecting:

Which LLM will serve as the evaluator
Evaluation frequency (daily, weekly, on-demand)
Evaluation criteria weighting

Run Evaluations

Execute the evaluation process, either automatically on schedule or manually.

Review Results

Analyze the evaluation scores and trends over time.The evaluation dashboard shows:

Overall performance scores
Performance trends over time
Breakdowns by question type
Detailed analysis of retrieved contexts

Export and Share

Export test sets and results for documentation, sharing, or further analysis.

Human-in-the-Loop Evaluation

Combine automated testing with human expertise for comprehensive quality control: Human reviewers can:

Review and override automated evaluation scores
Provide qualitative feedback on responses
Identify subtle issues that automated systems miss
Add new test questions based on emerging needs
Validate context quality and relevance

Custom Evaluation with Webhooks

For specialized evaluation needs, you can implement custom processes using Webhooks and AI Builder:

Configure Webhook Endpoint

Set up a Webhook URL that will listen for test events.

{
  "webhook_url": "https://your-custom-evaluator.com/api/evaluate",
  "authentication": {
    "type": "bearer_token",
    "token": "${ENV_SECRET_TOKEN}"
  }
}

Implement Custom Evaluation Logic

Create evaluation logic that processes test results according to your specific criteria.Custom evaluations can include:

Domain-specific quality metrics
Compliance and regulatory checks
Industry terminology validation
Integration with existing quality systems

Return Standardized Results

Send evaluation results back to Prisme.ai in the standard scoring format.

{
  "score": 2,
  "context": 1,
  "analysis": "Response was accurate but missing some context about recent policy changes.",
  "custom_metrics": {
    "compliance_score": 0.95,
    "terminology_accuracy": 0.87
  }
}

Strategic Benefits of Testing

Comprehensive testing delivers significant benefits beyond simple quality control:

Monitor Data Source Changes

Detect when changes to underlying data sources affect response quality.

This allows you to:

Prevent regressions when content is updated
Identify when knowledge gaps emerge
Maintain consistency across content updates

Optimize LLM Selection

Evaluate performance across different LLM providers and models.

This enables you to:

Select more cost-efficient models
Reduce energy consumption
Use specialized or self-hosted models when appropriate
Make data-driven model migration decisions

Engage Business Stakeholders

Foster ownership of content quality among domain experts.

This helps to:

Demonstrate the impact of quality source material
Create accountability for knowledge accuracy
Build trust in AI system outputs
Drive continuous content improvement

Establish Tech-Business Alignment

Create a shared understanding of performance metrics and goals.

This leads to:

Clear performance contracts between teams
Shared optimization targets
Better resource allocation
Transparent communication about capabilities

Testing Methodology: Start Simple

We recommend an iterative testing approach that builds from foundational tests to more complex scenarios:

Initial Test Set (15 Questions)

Start with a manageable set of diverse test cases:

5 Simple Questions
5 Moderate Questions
5 Complex Questions

Basic factual queries with straightforward answers.Examples:

“What is our company’s return policy?”
“Who is the contact person for technical support?”
“What are the operating hours for customer service?”

Purpose: Establish a baseline for core knowledge retrieval.

Iterative Optimization

After initial testing, systematically adjust and retest to improve performance:

Adjust LLM Parameters

Experiment with:

Prompt engineering adjustments
Temperature and creativity settings
Different models or model versions

Refine RAG Configuration

Optimize how information is processed and retrieved:

Chunking strategies
Indexing methods
Retrieval mechanisms
Context handling

Integrate Tools

Add specialized capabilities where needed:

Calculators for numerical questions
Structured data tools for comparisons
Visualization tools for complex data

Expand Test Set

Once performance is optimized, increase test coverage:

Add more edge cases
Include newly discovered user questions
Create tests for specific user personas

Best Practices

Test Creation

Base test questions on actual user queries when possible
Include a mix of simple, moderate, and complex questions
Create test cases that cover all key knowledge domains
Update test sets as user needs and content evolve
Include edge cases and potential failure scenarios

Evaluation Approach

Use automated evaluation for regular monitoring
Incorporate human review for high-stakes applications
Test both positive scenarios (what the agent should do) and negative scenarios (what it shouldn’t do)
Establish clear evaluation criteria before testing
Compare performance across different agent configurations

Continuous Improvement

Schedule regular re-evaluation of agent performance
Analyze patterns in low-scoring responses
Document configuration changes and their impact
Establish feedback loops with end users
Create a prioritization framework for addressing issues

Team Collaboration

Include both technical and business stakeholders in test creation
Share testing results transparently across teams
Establish clear ownership for different aspects of quality
Create shared performance goals and targets
Celebrate improvements in agent quality

Next Steps

RAG Configuration

Learn how to optimize retrieval and generation settings

Tools Integration

Enhance your agent with specialized capabilities

Advanced RAG

Explore sophisticated RAG architectures

Analytics

Monitor agent performance metrics

Overview

AI SecureChat

AI Store

AI Knowledge

AI Builder

AI Governance

AI Collection (beta)

AI Insights (beta)

Testing Approaches

Evaluation Framework

Response Quality

Context Quality

Hallucination Check

Automated Evaluation Process

Human-in-the-Loop Evaluation

Custom Evaluation with Webhooks

Strategic Benefits of Testing

Monitor Data Source Changes

Optimize LLM Selection

Engage Business Stakeholders

Establish Tech-Business Alignment

Testing Methodology: Start Simple

Initial Test Set (15 Questions)

Iterative Optimization

Best Practices

Next Steps

RAG Configuration

Tools Integration

Advanced RAG

Analytics

Overview

AI SecureChat

AI Store

AI Knowledge

AI Builder

AI Governance

AI Collection (beta)

AI Insights (beta)

Documentation Index

​Testing Approaches

​Evaluation Framework

Response Quality

Context Quality

Hallucination Check

​Automated Evaluation Process

​Human-in-the-Loop Evaluation

​Custom Evaluation with Webhooks

​Strategic Benefits of Testing

Monitor Data Source Changes

Optimize LLM Selection

Engage Business Stakeholders

Establish Tech-Business Alignment

​Testing Methodology: Start Simple

​Initial Test Set (15 Questions)

​Iterative Optimization

​Best Practices

​Next Steps

RAG Configuration

Tools Integration

Advanced RAG

Analytics

Testing Approaches

Evaluation Framework

Automated Evaluation Process

Human-in-the-Loop Evaluation

Custom Evaluation with Webhooks

Strategic Benefits of Testing

Testing Methodology: Start Simple

Initial Test Set (15 Questions)

Iterative Optimization

Best Practices

Next Steps