QLANKR Test
AI Agent Evaluation Platform · test.qlankr.com
What QLANKR Test does
QLANKR Test evaluates AI systems across multiple quality dimensions using independent AI judges. Users submit agent output (chat transcripts, RAG Q&A pairs, tool call traces, classification results, generated content) and receive a QI score from 0 to 100 with per-dimension breakdowns, identified strengths, and specific improvement recommendations. Results are presented as shareable report cards.
Who it is for
- Developers building AI agents, chatbots, or automated systems
- Teams evaluating RAG pipelines, tool-calling agents, or content generation
- Anyone who needs structured, repeatable AI quality assessment before shipping
How it works
- Select an assessment template (e.g., Support Agent, RAG Accuracy, Tool-Use Correctness)
- Submit your agent's output data
- Independent AI judges evaluate across multiple quality dimensions
- Receive a QI score (0-100) with per-dimension breakdowns and actionable recommendations
Assessment types
10 assessment templates are available:
- Support Agent - Accuracy, tone, completeness, escalation handling, safety
- RAG Accuracy - Faithfulness, relevancy, hallucination resistance, citation quality
- Tool-Use Correctness - Tool selection, parameter accuracy, sequencing, error handling
- Prompt Robustness - Jailbreak resistance, instruction following, safety, graceful refusal
- Content Generation - Factual accuracy, coherence, style, completeness, originality
- Multi-Agent Coordination - Delegation logic, coordination, conflict resolution
- Classification & Extraction - Label accuracy, extraction completeness, format compliance
- Production Readiness - Reliability, latency, error recovery, observability
- Code Generation - Functional correctness, code quality, security, documentation
- Readiness Checklist - Self-assessed operational readiness check
QI scoring
QI (QLANKR Intelligence) is a composite score from 0 to 100. It is the average of dimension scores, each independently evaluated by an AI judge. Pro users get 3-judge consensus scoring with agreement metrics. Scores map to bands:
- Strong (90-100) - Production-quality across all dimensions
- Moderate (70-89) - Good foundation with room for improvement
- Developing (40-69) - Partial coverage, some dimensions need work
- Early (0-39) - Foundational gaps across multiple dimensions
Key differentiators
- Multi-dimensional scoring, not a single pass/fail
- Independent AI judges (Gemini, GPT, Claude) with agreement metrics
- Published rubrics with full transparency
- Shareable report cards with permanent verification URLs
- Programmatic API for CI/CD integration
- No model training on submitted data
Pricing
- Free - 3 AI-judged evaluations per day, single judge, last 5 reports saved
- Pro ($19/month or $15/month annual) - 25 evaluations/day, 3-judge consensus, unlimited reports, PDF export, custom rubrics, API access, webhooks
Links
Contact
- General: hello@qlankr.com
- Support: support@qlankr.com
- Assessments & API: labs@qlankr.com
- Privacy: privacy@qlankr.com
QLANKR Test · Stockholm, Sweden · test.qlankr.com