> ## Documentation Index > Fetch the complete documentation index at: https://prismeai-legacy.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # Agent Testing > Validate and improve your knowledge-based agents through comprehensive testing approaches Creating effective AI agents requires thorough testing to ensure they provide accurate, helpful, and appropriate responses. Prisme.ai provides comprehensive testing capabilities to validate your agents before deployment and continuously improve them over time. ## Testing Approaches Prisme.ai supports multiple testing methodologies to ensure your agents meet your organization's standards: Direct interaction with the agent to assess its responses. AI-powered evaluations that assess agent responses based on predefined criteria. Combines automated testing with human review for comprehensive evaluation. Specialized evaluation processes implemented via Webhooks and AI Builder. Custom Evaluation Interface

## Evaluation Framework Prisme.ai uses a straightforward evaluation system that makes it easy to assess agent performance: Assesses how well the agent answers the question

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Evaluates how well the agent retrieved relevant information

Score: 0 (Poor), 1 (Adequate), 2 (Excellent)

Identifies if the agent made up information

Score: 0 (Significant), 1 (Minor), 2 (None)

This simple three-point scale makes evaluation straightforward while providing meaningful insights into agent performance. ## Automated Evaluation Process The automated evaluation process uses LLMs as judges to assess agent performance: Develop a set of representative questions that users might ask your agent. Set up the evaluation process by selecting: * Which LLM will serve as the evaluator * Evaluation frequency (daily, weekly, on-demand) * Evaluation criteria weighting Execute the evaluation process, either automatically on schedule or manually. Analyze the evaluation scores and trends over time. The evaluation dashboard shows: * Overall performance scores * Performance trends over time * Breakdowns by question type * Detailed analysis of retrieved contexts Export test sets and results for documentation, sharing, or further analysis. ## Human-in-the-Loop Evaluation Combine automated testing with human expertise for comprehensive quality control: Human reviewers can: * Review and override automated evaluation scores * Provide qualitative feedback on responses * Identify subtle issues that automated systems miss * Add new test questions based on emerging needs * Validate context quality and relevance ## Custom Evaluation with Webhooks For specialized evaluation needs, you can implement custom processes using Webhooks and AI Builder: Set up a Webhook URL that will listen for test events. ```json Example Webhook Configuration theme={null} { "webhook_url": "https://your-custom-evaluator.com/api/evaluate", "authentication": { "type": "bearer_token", "token": "${ENV_SECRET_TOKEN}" } } ``` Create evaluation logic that processes test results according to your specific criteria. Custom evaluations can include: * Domain-specific quality metrics * Compliance and regulatory checks * Industry terminology validation * Integration with existing quality systems Send evaluation results back to Prisme.ai in the standard scoring format. ```json Example Response Format theme={null} { "score": 2, "context": 1, "analysis": "Response was accurate but missing some context about recent policy changes.", "custom_metrics": { "compliance_score": 0.95, "terminology_accuracy": 0.87 } } ``` ## Strategic Benefits of Testing Comprehensive testing delivers significant benefits beyond simple quality control:

Detect when changes to underlying data sources affect response quality.

This allows you to:

Prevent regressions when content is updated
Identify when knowledge gaps emerge
Maintain consistency across content updates

Evaluate performance across different LLM providers and models.

This enables you to:

Select more cost-efficient models
Reduce energy consumption
Use specialized or self-hosted models when appropriate
Make data-driven model migration decisions

Foster ownership of content quality among domain experts.

This helps to:

Demonstrate the impact of quality source material
Create accountability for knowledge accuracy
Build trust in AI system outputs
Drive continuous content improvement

Create a shared understanding of performance metrics and goals.

This leads to:

Clear performance contracts between teams
Shared optimization targets
Better resource allocation
Transparent communication about capabilities

## Testing Methodology: Start Simple We recommend an iterative testing approach that builds from foundational tests to more complex scenarios: ### Initial Test Set (15 Questions) Start with a manageable set of diverse test cases: Basic factual queries with straightforward answers. **Examples**: * "What is our company's return policy?" * "Who is the contact person for technical support?" * "What are the operating hours for customer service?" **Purpose**: Establish a baseline for core knowledge retrieval. Queries requiring some synthesis or comparison. **Examples**: * "How do our Standard and Premium plans differ?" * "What steps should I take if a customer requests a refund after 30 days?" * "Explain the main benefits of our latest product update." **Purpose**: Test the agent's ability to connect related information. Multi-part or nuanced queries requiring deeper understanding. **Examples**: * "What are the tradeoffs between our cloud and on-premises deployment options for enterprise customers with strict data residency requirements?" * "How have our sustainability initiatives impacted our manufacturing costs and product pricing over the past three years?" * "What are the recommended approaches for implementing our API in a high-throughput environment with legacy system integration?" **Purpose**: Challenge the agent's advanced capabilities. ### Iterative Optimization After initial testing, systematically adjust and retest to improve performance: Experiment with: * Prompt engineering adjustments * Temperature and creativity settings * Different models or model versions Optimize how information is processed and retrieved: * Chunking strategies * Indexing methods * Retrieval mechanisms * Context handling Add specialized capabilities where needed: * Calculators for numerical questions * Structured data tools for comparisons * Visualization tools for complex data Once performance is optimized, increase test coverage: * Add more edge cases * Include newly discovered user questions * Create tests for specific user personas ## Best Practices * Base test questions on actual user queries when possible * Include a mix of simple, moderate, and complex questions * Create test cases that cover all key knowledge domains * Update test sets as user needs and content evolve * Include edge cases and potential failure scenarios * Use automated evaluation for regular monitoring * Incorporate human review for high-stakes applications * Test both positive scenarios (what the agent should do) and negative scenarios (what it shouldn't do) * Establish clear evaluation criteria before testing * Compare performance across different agent configurations * Schedule regular re-evaluation of agent performance * Analyze patterns in low-scoring responses * Document configuration changes and their impact * Establish feedback loops with end users * Create a prioritization framework for addressing issues * Include both technical and business stakeholders in test creation * Share testing results transparently across teams * Establish clear ownership for different aspects of quality * Create shared performance goals and targets * Celebrate improvements in agent quality ## Next Steps Learn how to optimize retrieval and generation settings Enhance your agent with specialized capabilities Explore sophisticated RAG architectures Monitor agent performance metrics