ASQI Engineer¶

ASQI (AI Solutions Quality Index) Engineer is a comprehensive framework for systematic testing and quality assurance of AI systems. Developed from Resaro’s experience bridging governance, technical and business requirements, ASQI Engineer enables rigorous evaluation of AI systems through containerized test packages, automated assessment, and durable execution workflows.

ASQI Engineer is in active development and we welcome contributors to contribute new test packages, share score cards and test plans, and help define common schemas to meet industry needs. Our initial release focuses on comprehensive chatbot testing with extensible foundations for broader AI system evaluation.

Quick Start GitHub

Key Features¶

⚡ Durable Execution

DBOS-powered fault tolerance with automatic retry and recovery for reliable test execution.

🐳 Container Isolation

Reproducible testing in isolated Docker environments with consistent, repeatable results.

🎭 Multi-System Orchestration

Coordinate target, simulator, and evaluator systems in complex testing workflows.

📊 Flexible Assessment

Configurable score cards map technical metrics to business-relevant outcomes.

🛡️ Type-Safe Configuration

Pydantic schemas with JSON Schema generation provide IDE integration and validation.

🔄 Modular Workflows

Separate validation, test execution, and evaluation phases for flexible CI/CD integration.

LLM Testing¶

For our first release, we have introduced the llm_api system type and contributed five test packages for comprehensive LLM system testing. We have also open-sourced a draft ASQI score card for customer chatbots that provides mappings between technical metrics and business-relevant assessment criteria.

Beyond LLM testing, we support image generation, image and vision language models through image_generation_api, image_editing_api, and vlm_api system types, enabling comprehensive evaluation of image generative systems.

LLM Test Containers¶

Garak: Security vulnerability assessment with 40+ attack vectors and probes
DeepTeam: Red teaming library for adversarial robustness testing
TrustLLM: Comprehensive framework and benchmarks to evaluate trustworthiness of LLM systems
Inspect Evals: Comprehensive evaluation suite with 80+ tasks across cybersecurity, mathematics, reasoning, knowledge, bias, and safety domains
Resaro Chatbot Simulator: Persona and scenario based conversational testing with multi-turn dialogue simulation

The llm_api, image_generation_api, image_editing_api, and vlm_api system types use OpenAI-compatible API interfaces. Through LiteLLM integration, ASQI Engineer provides unified access to 100+ AI providers including OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, and custom endpoints for text, image generation, and vision models.

Test Packages¶

🔒 Security Testing

Comprehensive vulnerability assessment with Garak (40+ security probes) and DeepTeam (advanced red teaming) frameworks.

Security Containers

💬 Conversation Quality

Multi-turn dialogue testing with persona-based simulation and LLM-as-judge evaluation for realistic chatbot assessment.

Quality Testing

🎯 Trustworthiness

Academic-grade evaluation across 6 trust dimensions using the TrustLLM framework for comprehensive assessment.

Trust Evaluation

🎨 Image Generation Testing

Evaluate image generation quality with VLM-as-judge assessment for prompt adherence, aesthetics, and safety.

Image Testing

🔧 Custom Testing

Build domain-specific test containers with standardized interfaces and multi-system orchestration capabilities.

Create Containers

Beta: ASQI Chatbot Quality Index¶

Warning

🚧 This is a draft quality index under active development.

ASQI Engineer includes a draft comprehensive quality index specifically designed for chatbot systems. This beta feature provides a standardized framework for evaluating chatbot quality across multiple dimensions that matter to businesses deploying conversational AI.

What is the ASQI Chatbot Quality Index?¶

The ASQI Chatbot Quality Index is a multi-dimensional assessment framework that evaluates chatbot systems across performance and risk handling across eight key areas:

Relevance: How relevant is the information provided by the chatbot?
Accuracy: How correct is the information provided by the chatbot?
Consistency: How consistently does the chatbot perform when users express the same intent using different words, styles, or structures?
Out-of-domain Handling: How well does the chatbot identify when users are asking for something it’s not designed to do?
Bias Mitigation: How effectively does the chatbot avoid biased, stereotypical, or discriminatory responses?
Toxicity Control: To what extent is offensive or toxic output controlled?
Competition Mention: How effectively does the chatbot avoid promoting competitors while maintaining appropriate responses when directly asked about market alternatives?
Jailbreaking Resistance: How strong is the resistance to different jailbreaking techniques?

Running the ASQI Chatbot Evaluation¶

The complete evaluation combines multiple test containers and provides comprehensive scoring:

# Download the comprehensive chatbot test suite
curl -O https://raw.githubusercontent.com/asqi-engineer/asqi-engineer/main/config/suites/asqi_chatbot_test_suite.yaml
curl -O https://raw.githubusercontent.com/asqi-engineer/asqi-engineer/main/config/score_cards/asqi_chatbot_score_card.yaml

# Run comprehensive chatbot evaluation (tests multiple containers)
asqi execute \
  -t asqi_chatbot_test_suite.yaml \
  -s demo_systems.yaml \
  -r asqi_chatbot_score_card.yaml \
  -o chatbot_asqi_assessment.json

Output Files Generated:

chatbot_asqi_assessment.json - Assessment and test results
./logs/chatbot_asqi_assessment.json - Test container output and error message

Beta Status and Collaboration¶

We are actively seeking collaboration from the community to:

Refine assessment criteria: Help define industry-standard thresholds and grading scales
Expand test coverage: Contribute new test scenarios and edge cases
Develop domain-specific indices: Create specialized quality indices for different chatbot use cases
Validate against real deployments: Share feedback from production chatbot evaluations

We welcome contributions through GitHub Issues to discuss collaboration opportunities or share your experience with the beta ASQI Chatbot Quality Index.