You send the same question to your Drupal AI agent twice and get different answers. You tweak the system prompt and something that used to work, now stops working. You swap models, and while everything seems fine on surface, the resulting behaviour is considerably different.
What we are missing is a way to systematically measure and evaluate AI performance and its output quality across prompt and model changes. We not only need to ensure the system keeps working, but we also want to know how well it does so. Can we do better than “somebody tried it manually and it seemed fine”?
What we built
ai_eval is a Drupal module for automated evaluation of AI agents and chat prompts. It’s on drupal.org now as an alpha release.
The basic unit is an eval target: a config entity that points at an agent (or any AI provider) plus a YAML dataset of questions with grading criteria. You run the eval, the module invokes the agent for each question, grades every response with pluggable grader plugins, and gives you a score with a pass/fail verdict.
A dataset looks like this:
questions:
- id: S01
input: "What is Drupal?"
criteria: "Should provide a brief, accurate description of Drupal CMS
including that it is open-source and PHP-based"
- id: S02
input: "How do I create a custom module in Drupal?"
criteria: "Should mention .info.yml file, src/ directory
with PSR-4 namespacing, and enabling via drush"
- id: S05
input: "Return a JSON object with keys: name, version, status"
criteria: "Response must be valid JSON with the requested keys"
expected:
format: json
Each target defines which graders to apply. The module ships with five grader plugins:
- Relevance, Completeness, Accuracy, Actionability: four LLM-as-judge graders that send the question, criteria, and response to a judge model and score 1–5
- Format: a deterministic grader that validates JSON structure or text formatting rules with no API call
Graders are Drupal plugins. Writing a new one is a PHP attribute and a handful of methods: a dimension description for LLM judges, or a grade() method for deterministic graders.
Each target also defines a quality gate: a threshold score and a gate type. Hard gates return a non-zero exit code when the score drops below threshold. Put drush ai-eval:run in your deployment pipeline and a broken prompt blocks the deploy.
# Run all eval targets
drush ai-eval:run
# Run one target, output JSON
drush ai-eval:run --target=my_agent --json
# Analyze failures and propose improved prompts
drush ai-eval:optimize --propose
The optimize command is worth mentioning separately: it runs baseline evals, identifies failing questions, and uses an LLM to generate an improved system prompt based on the failure patterns. Proposals go through a review workflow in the admin UI before they take effect. It’s experimental and the results vary, but when it works it saves hours of prompt iteration.
Why we built it
We run multiple AI agents in production handling real team queries: code review, project lookup, search, memory recall. When we changed a prompt or swapped a model, we had no way to know if we broke something other than waiting for someone to complain.
Once, one prompt change caused the router agent to send memory recall questions to the project management tool. We caught it by accident, and that was what motivated building eval into our workflow.
Now every prompt change gets tested against a dataset before it goes live. The eval runs take a few minutes and cost a few cents in API calls.
Interesting findings
Last week we used ai_eval to benchmark five LLM providers operating as the backbone for our production agents. The evaluation dataset consisted of 24 retrieval questions across entity lookup, aggregation, semantic search, keyword search, cross-source queries, and safety checks. Every question requires tool use. As no model has our data in its training set, any hallucination is immediately visible.
| Model | Pass Rate | Avg Score | Cost (24 Qs) | Tool Reliability |
|---|---|---|---|---|
| Gemma 4 26B | 38% | 3.24 | EUR 0.05 | 63% |
| GPT-4o-mini | 42% | 3.41 | EUR 0.10 | 63% |
| Qwen 3 32B | 88% | 4.21 | EUR 0.03 | 100% |
| Sonnet 4.6 | 89%* | 4.62* | EUR 4.00 | 100% |
*Sonnet partial run (9/24 questions before credits ran out).
Qwen 3 32B came within 1% of Sonnet’s pass rate at a fraction of the cost, with zero hallucinations. This impressed us so much that we switched our production agents to it the same day. As for the smaller models, their results were poor and unacceptable, confidently fabricating responses without ever calling the correct tools that would lead them to the right answers.
We wouldn’t have found this without automated eval. Manual testing with 5 questions would have shown all models “working.” Only a systematic run across 24 questions with both deterministic and LLM-judge scoring revealed that smaller models silently skip tool calls 37% of the time.
How it fits the ecosystem
Other approaches to AI evaluation exist in Drupal, each serving different needs and geared towards different audiences:
- The ai_evaluations module focuses on a sitebuilder-friendly UI for creating and running evaluations.
- ai_agents_test is a separate module for testing agent decision-making against behavioral rules.
ai_eval is positioned in the CI/developer layer: YAML datasets live in version control alongside your code. Drush commands run in pipelines. Quality gates block deploys. It depends on the AI module as its only hard requirement and supports ai_agents as an optional dependency.
The Drupal AI ecosystem is still figuring out what “quality” means for non-deterministic systems. Multiple approaches is better than one premature standard. We hope that ai_eval has an important role to play in this discussion.
What’s next
We’re planning to run a BoF session at Drupal Dev Days Athens (April 22–25): “Testing AI Agents: Live Eval Session.” Quick live demo followed by a discussion about what graders and datasets the community actually needs.
Things we know are missing:
- Grader plugins for domains we haven’t covered. Translation quality, content moderation, coding standards compliance, hallucination detection, brand voice consistency. Each of these is a plugin with one method. If you have a domain, you can write a grader. See the integration guide for how.
- Shared eval datasets. A good dataset of 20–50 questions for Drupal site-building Q&A, or e-commerce recommendations, or support ticket routing, benefits every team building agents in that domain.
- More deterministic graders. LLM judges cost money and add latency. Where you can check a concrete property (valid JSON, correct route, expected field values) a deterministic grader is faster, cheaper, and more reliable.
Try it
composer require drupal/ai_eval
drush en ai_eval
The module requires Drupal 11.2+, PHP 8.3, and the AI module. Agent mode requires AI Agents. The admin UI is at /admin/config/ai/ai-eval.
The API will change. There are rough edges. But the problem is real and we’d rather work on it with others than in isolation. Let’s join forces:
- Project page: drupal.org/project/ai_eval
- Issue queue: drupal.org/project/issues/ai_eval
- Dev Days BoF: Thursday or Friday, check the schedule board
- Contribution Day: Saturday sprint table, bring a laptop and a domain you care about
— George Kastanis (zorz), Point Blank


