> ## Documentation Index
> Fetch the complete documentation index at: https://prismeai-legacy.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent Testing

> Validate and improve your knowledge-based agents through comprehensive testing approaches

Creating effective AI agents requires thorough testing to ensure they provide accurate, helpful, and appropriate responses. Prisme.ai provides comprehensive testing capabilities to validate your agents before deployment and continuously improve them over time.

## Testing Approaches

Prisme.ai supports multiple testing methodologies to ensure your agents meet your organization's standards:

<Tabs>
  <Tab title="Manual Testing">
    Direct interaction with the agent to assess its responses.

    <Properties>
      <Property name="Strengths" value="Intuitive, flexible, exploratory" />

      <Property name="Best For" value="Initial validation, unexpected scenarios, subjective assessments" />

      <Property name="Limitations" value="Time-consuming, not easily repeatable, potential inconsistency" />
    </Properties>
  </Tab>

  <Tab title="Automated Evaluation">
    AI-powered evaluations that assess agent responses based on predefined criteria.

    <Properties>
      <Property name="Strengths" value="Consistent, scalable, objective" />

      <Property name="Best For" value="Regression testing, continuous evaluation, large-scale testing" />

      <Property name="Limitations" value="Requires proper configuration, may miss subjective nuances" />
    </Properties>
  </Tab>

  <Tab title="Human-in-the-Loop">
    Combines automated testing with human review for comprehensive evaluation.

    <Properties>
      <Property name="Strengths" value="High accuracy, captures subjective aspects, expert insight" />

      <Property name="Best For" value="Critical applications, sensitive content, complex evaluations" />

      <Property name="Limitations" value="Resource intensive, potential for inconsistency between reviewers" />
    </Properties>
  </Tab>

  <Tab title="Custom Evaluation">
    Specialized evaluation processes implemented via Webhooks and AI Builder.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/prismeai-legacy/images/custom-evaluation.png" alt="Custom Evaluation Interface" />
    </Frame>

    <Properties>
      <Property name="Strengths" value="Highly customizable, domain-specific metrics, integration with existing systems" />

      <Property name="Best For" value="Industry-specific requirements, specialized metrics, integration with quality systems" />

      <Property name="Limitations" value="Requires technical implementation, additional maintenance" />
    </Properties>
  </Tab>
</Tabs>

## Evaluation Framework

Prisme.ai uses a straightforward evaluation system that makes it easy to assess agent performance:

<CardGroup cols={3}>
  <Card title="Response Quality" icon="message">
    Assesses how well the agent answers the question
    <p><strong>Score:</strong> 0 (Poor), 1 (Adequate), 2 (Excellent)</p>
  </Card>

  <Card title="Context Quality" icon="file-lines">
    Evaluates how well the agent retrieved relevant information
    <p><strong>Score:</strong> 0 (Poor), 1 (Adequate), 2 (Excellent)</p>
  </Card>

  <Card title="Hallucination Check" icon="brain">
    Identifies if the agent made up information
    <p><strong>Score:</strong> 0 (Significant), 1 (Minor), 2 (None)</p>
  </Card>
</CardGroup>

This simple three-point scale makes evaluation straightforward while providing meaningful insights into agent performance.

## Automated Evaluation Process

The automated evaluation process uses LLMs as judges to assess agent performance:

<Steps>
  <Step title="Create Test Questions">
    Develop a set of representative questions that users might ask your agent.
  </Step>

  <Step title="Configure Evaluation Parameters">
    Set up the evaluation process by selecting:

    * Which LLM will serve as the evaluator
    * Evaluation frequency (daily, weekly, on-demand)
    * Evaluation criteria weighting
  </Step>

  <Step title="Run Evaluations">
    Execute the evaluation process, either automatically on schedule or manually.
  </Step>

  <Step title="Review Results">
    Analyze the evaluation scores and trends over time.

    The evaluation dashboard shows:

    * Overall performance scores
    * Performance trends over time
    * Breakdowns by question type
    * Detailed analysis of retrieved contexts
  </Step>

  <Step title="Export and Share">
    Export test sets and results for documentation, sharing, or further analysis.
  </Step>
</Steps>

## Human-in-the-Loop Evaluation

Combine automated testing with human expertise for comprehensive quality control:

Human reviewers can:

* Review and override automated evaluation scores
* Provide qualitative feedback on responses
* Identify subtle issues that automated systems miss
* Add new test questions based on emerging needs
* Validate context quality and relevance

## Custom Evaluation with Webhooks

For specialized evaluation needs, you can implement custom processes using Webhooks and AI Builder:

<Steps>
  <Step title="Configure Webhook Endpoint">
    Set up a Webhook URL that will listen for test events.

    <CodeGroup>
      ```json Example Webhook Configuration theme={null}
      {
        "webhook_url": "https://your-custom-evaluator.com/api/evaluate",
        "authentication": {
          "type": "bearer_token",
          "token": "${ENV_SECRET_TOKEN}"
        }
      }
      ```
    </CodeGroup>
  </Step>

  <Step title="Implement Custom Evaluation Logic">
    Create evaluation logic that processes test results according to your specific criteria.

    Custom evaluations can include:

    * Domain-specific quality metrics
    * Compliance and regulatory checks
    * Industry terminology validation
    * Integration with existing quality systems
  </Step>

  <Step title="Return Standardized Results">
    Send evaluation results back to Prisme.ai in the standard scoring format.

    <CodeGroup>
      ```json Example Response Format theme={null}
      {
        "score": 2,
        "context": 1,
        "analysis": "Response was accurate but missing some context about recent policy changes.",
        "custom_metrics": {
          "compliance_score": 0.95,
          "terminology_accuracy": 0.87
        }
      }
      ```
    </CodeGroup>
  </Step>
</Steps>

## Strategic Benefits of Testing

Comprehensive testing delivers significant benefits beyond simple quality control:

<CardGroup cols={2}>
  <Card title="Monitor Data Source Changes" icon="database">
    <p>Detect when changes to underlying data sources affect response quality.</p>
    <p>This allows you to:</p>

    <ul>
      <li>Prevent regressions when content is updated</li>
      <li>Identify when knowledge gaps emerge</li>
      <li>Maintain consistency across content updates</li>
    </ul>
  </Card>

  <Card title="Optimize LLM Selection" icon="microchip">
    <p>Evaluate performance across different LLM providers and models.</p>
    <p>This enables you to:</p>

    <ul>
      <li>Select more cost-efficient models</li>
      <li>Reduce energy consumption</li>
      <li>Use specialized or self-hosted models when appropriate</li>
      <li>Make data-driven model migration decisions</li>
    </ul>
  </Card>

  <Card title="Engage Business Stakeholders" icon="users">
    <p>Foster ownership of content quality among domain experts.</p>
    <p>This helps to:</p>

    <ul>
      <li>Demonstrate the impact of quality source material</li>
      <li>Create accountability for knowledge accuracy</li>
      <li>Build trust in AI system outputs</li>
      <li>Drive continuous content improvement</li>
    </ul>
  </Card>

  <Card title="Establish Tech-Business Alignment" icon="handshake">
    <p>Create a shared understanding of performance metrics and goals.</p>
    <p>This leads to:</p>

    <ul>
      <li>Clear performance contracts between teams</li>
      <li>Shared optimization targets</li>
      <li>Better resource allocation</li>
      <li>Transparent communication about capabilities</li>
    </ul>
  </Card>
</CardGroup>

## Testing Methodology: Start Simple

We recommend an iterative testing approach that builds from foundational tests to more complex scenarios:

### Initial Test Set (15 Questions)

Start with a manageable set of diverse test cases:

<Tabs>
  <Tab title="5 Simple Questions">
    Basic factual queries with straightforward answers.

    **Examples**:

    * "What is our company's return policy?"
    * "Who is the contact person for technical support?"
    * "What are the operating hours for customer service?"

    **Purpose**: Establish a baseline for core knowledge retrieval.
  </Tab>

  <Tab title="5 Moderate Questions">
    Queries requiring some synthesis or comparison.

    **Examples**:

    * "How do our Standard and Premium plans differ?"
    * "What steps should I take if a customer requests a refund after 30 days?"
    * "Explain the main benefits of our latest product update."

    **Purpose**: Test the agent's ability to connect related information.
  </Tab>

  <Tab title="5 Complex Questions">
    Multi-part or nuanced queries requiring deeper understanding.

    **Examples**:

    * "What are the tradeoffs between our cloud and on-premises deployment options for enterprise customers with strict data residency requirements?"
    * "How have our sustainability initiatives impacted our manufacturing costs and product pricing over the past three years?"
    * "What are the recommended approaches for implementing our API in a high-throughput environment with legacy system integration?"

    **Purpose**: Challenge the agent's advanced capabilities.
  </Tab>
</Tabs>

### Iterative Optimization

After initial testing, systematically adjust and retest to improve performance:

<Steps>
  <Step title="Adjust LLM Parameters">
    Experiment with:

    * Prompt engineering adjustments
    * Temperature and creativity settings
    * Different models or model versions
  </Step>

  <Step title="Refine RAG Configuration">
    Optimize how information is processed and retrieved:

    * Chunking strategies
    * Indexing methods
    * Retrieval mechanisms
    * Context handling
  </Step>

  <Step title="Integrate Tools">
    Add specialized capabilities where needed:

    * Calculators for numerical questions
    * Structured data tools for comparisons
    * Visualization tools for complex data
  </Step>

  <Step title="Expand Test Set">
    Once performance is optimized, increase test coverage:

    * Add more edge cases
    * Include newly discovered user questions
    * Create tests for specific user personas
  </Step>
</Steps>

## Best Practices

<AccordionGroup>
  <Accordion title="Test Creation">
    * Base test questions on actual user queries when possible
    * Include a mix of simple, moderate, and complex questions
    * Create test cases that cover all key knowledge domains
    * Update test sets as user needs and content evolve
    * Include edge cases and potential failure scenarios
  </Accordion>

  <Accordion title="Evaluation Approach">
    * Use automated evaluation for regular monitoring
    * Incorporate human review for high-stakes applications
    * Test both positive scenarios (what the agent should do) and negative scenarios (what it shouldn't do)
    * Establish clear evaluation criteria before testing
    * Compare performance across different agent configurations
  </Accordion>

  <Accordion title="Continuous Improvement">
    * Schedule regular re-evaluation of agent performance
    * Analyze patterns in low-scoring responses
    * Document configuration changes and their impact
    * Establish feedback loops with end users
    * Create a prioritization framework for addressing issues
  </Accordion>

  <Accordion title="Team Collaboration">
    * Include both technical and business stakeholders in test creation
    * Share testing results transparently across teams
    * Establish clear ownership for different aspects of quality
    * Create shared performance goals and targets
    * Celebrate improvements in agent quality
  </Accordion>
</AccordionGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="RAG Configuration" icon="sliders" href="rag-configuration">
    Learn how to optimize retrieval and generation settings
  </Card>

  <Card title="Tools Integration" icon="screwdriver-wrench" href="tools-integration">
    Enhance your agent with specialized capabilities
  </Card>

  <Card title="Advanced RAG" icon="wand-sparkles" href="advanced-rag">
    Explore sophisticated RAG architectures
  </Card>

  <Card title="Analytics" icon="chart-line" href="analytics">
    Monitor agent performance metrics
  </Card>
</CardGroup>
