The whiz.coach question pipeline turns a learner's request (a topic, a syllabus, a difficulty target) into a verified bank of examination-grade practice questions, each illustrated where appropriate with a publication-quality diagram. Behind that single click sits a five-agent cloud-resident system that decomposes question authoring, diagram drawing, multimodal validation, and pedagogical improvement into separately scalable, separately retryable stages. This paper describes the architecture, the agents, the Google Cloud primitives they share, and the engineering techniques that keep the pipeline correct, bounded, and educationally honest at production scale.
Generating exam-quality questions is a deceptively hard problem. A single request expands into dozens of artefacts that have to satisfy several independent quality bars simultaneously:
A single monolithic prompt fails on all of these: it overruns output budgets, couples unrelated failure modes, and produces output whose quality is impossible to verify after the fact. The system therefore decomposes generation into specialised agents and validation into specialised passes.
The pipeline lives entirely inside Google Cloud. Five specialised agents collaborate through Cloud Tasks queues; Firestore is the source of truth for both request state and per-question status; Cloud Storage hosts the rendered PNG diagrams; a multi-model AI service routes generation calls between Gemini (with thinking enabled) and a GPT-class fallback dedicated to SVG authoring.
Each agent extends a common base class, picks up a Cloud Tasks message, transitions per-question or per-request state in Firestore, and either chains to the next agent or returns. Five agents are in production for the question pipeline.
| Agent | Queue | Thinking | Responsibility |
|---|---|---|---|
| QuestionOrchestrator | orchestration | low | Reads request documents, validates subscription quotas, applies hourly rate limits, and enqueues a single fire-and-forget generation task. Returns immediately so the trigger does not hold the function timeout. |
| QuestionGenerationAgent | question-generation | medium | Fetches structured syllabus data (agent briefs, key terms from flashcards, exam patterns, worked-example style), generates 10 to 100 questions scaled by total study time, applies a Bloom's taxonomy distribution, and tags every question with a difficulty 1 to 5 (60%+ at 4 or 5). |
| DiagramAgent | diagram-generation | high | Authors SVG with a GPT-class model (primary) or Gemini (fallback) for strong math and visual reasoning; renders to PNG at 800×600; uploads to Cloud Storage; enqueues validation. Also regenerates PNG from corrected SVG in the iterative correction flow. |
| ValidationAgent | validation | medium | Multimodal pass with text plus the rendered PNG plus the SVG as reference. Checks factual correctness, answer accuracy, explanation clarity, and (critically) whether the diagram gives away the answer. Returns a corrected SVG when it detects any visual or geometric issue. |
| ImprovementAgent | improvement | medium | Receives validation feedback and rewrites the question, the explanation, or the MCQ choices. Bounded to 5 iterations; failure beyond that moves the question to the refuted-questions collection with a tracked reason. |
A sixth component, the cleanup sweeper, is a 30-minute Cloud Scheduler job rather than an agent. It re-enqueues questions stuck in any intermediate validation status and refutes anything that has exceeded its retry budget.
A learner or admin writing to the requests collection is the single entry point. The Firestore trigger never blocks; it just enqueues an orchestration task and returns.
When a question requires a diagram, the system has to render the diagram before validation can run, because the multimodal validator needs the PNG to inspect. The trigger inspects the requiresDiagram flag and chooses which queue to enqueue first.
The defining design choice of the diagram pipeline is the delayed counter increment. When the validator detects a visual issue and proposes a corrected SVG, the system does not count that attempt yet. The counter increments only after the corrected SVG has successfully rendered to a PNG. This prevents the system from burning correction attempts on malformed SVG that could never have produced a verifiable image in the first place.
Angle arcs are the most consistently failure-prone primitive in educational SVG. The arc has to curve into the shape, not away from it, which depends on the SVG path's sweep-flag. Models routinely get this wrong. The system uses a three-tier fallback that escalates from "fix the arc" to "drop the arc but keep the label" to "give up on this question":
| Tier | Attempts | Strategy |
|---|---|---|
| Tier 1 | 1 to 3 | Ask the validator to fix the arc by flipping the sweep-flag and re-positioning the arc. |
| Tier 2 | 4 to 5 | Remove the arc path entirely but keep the angle text label (for example "60°"). A diagram with labels but no arcs is still educationally valid. |
| Tier 3 | past 5 | Move the question to the refuted-questions collection with a tracked reason. No further work. |
When a learner submits negative feedback on a question, the feedback document creation triggers an orchestration task that fans out to validation with the feedback as additional context. If the re-validation surfaces issues, the question enters the improvement loop; if not, the feedback is logged and no change is made.
The pipeline composes a small set of Google Cloud primitives and a single external AI provider, each chosen for a specific operational property.
| Component | Role | How the pipeline uses it |
|---|---|---|
| Cloud Functions v2 | Agent runtime | A single HTTP entry point dispatches each Cloud Tasks message to the named agent. Per-function timeout is 30 minutes (1800 s) for the task processor; queue dispatch deadlines vary from 600 s default to 1800 s for long-running queues like question generation. |
| Cloud Tasks | Async messaging | Five queues — orchestration, question generation, diagram generation, validation, improvement — each with its own rate-limit and concurrency profile. Uniform retry budget of 2 retries (3 total attempts) with 60 s to 1800 s exponential backoff. Agents can request custom 20 to 240 minute delays for rate-limit retries via a re-queue mechanism. |
| Firestore | State + idempotency | Source of truth for requests, questions, quizzes, refuted-questions, and feedbacks collections. Triggers on document creation enqueue the next stage. Permanent flags (for example a validation-triggered marker) prevent re-triggering on subsequent updates. |
| Cloud Storage | Diagram host | Per-syllabus prefixes hold the rendered PNG diagrams at a stable path keyed by question id, so the validator can fetch the image without any cross-reference. |
| Vertex AI Gemini | Primary AI | Used for question generation, validation, and improvement at thinkingLevel: 'medium'; also for SVG fallback when the primary SVG model fails. Runs on the global endpoint; structured output is enforced through Zod schemas converted to Gemini's native JSON-schema format. |
| GPT-class SVG model | SVG authoring | Primary path for diagram generation because publication-quality SVG benefits from strong math-and-visual reasoning. Hosted via the OpenAI Responses API. Fallback path is Gemini, with the same Zod schema. |
| Cloud Scheduler | Maintenance | A 30-minute cleanup sweeper detects stuck questions in any intermediate status and either re-enqueues them or refutes them once retry budgets are exhausted. |
Each queue isolates one stage so a backlog in one cannot stall another. Queue-level concurrency caps and rates per second map directly onto provider quotas.
| Queue | Concurrency | Rate | Retries |
|---|---|---|---|
| orchestration-queue | 20 | 10 / sec | 2 |
| question-generation-queue | 5 | 2 / sec | 2 |
| diagram-generation-queue | 5 | 2 / sec | 2 |
| validation-queue | 8 | 3 / sec | 2 |
| improvement-queue | 5 | 2 / sec | 2 |
| Bound | Where | What it prevents |
|---|---|---|
| 5 improvement iterations | ImprovementAgent | Endless validate ↔ improve cycles on questions the validator and improver cannot agree on. |
| 5 diagram corrections (after PNG success) | ValidationAgent + DiagramAgent | Repeated cosmetic patches that never converge to a clean diagram. |
| Tier 1 / Tier 2 / Tier 3 angle-arc fallback | ValidationAgent | Models repeatedly misplacing angle arcs; falls back to "label-only" before refuting. |
| 2 Cloud Tasks retries (3 total) | Queue config | Transient provider failures masking as permanent. |
| 15 attempts hard cap | Base-agent attempt counter | Any path that escaped the agent-specific bounds. |
| 30-minute cleanup sweep | Cloud Scheduler | Questions stalled in an intermediate status because a task was dropped, throttled, or timed out. |
The orchestrator never waits for the question generator. Generation can take several minutes for a 100-item request and would otherwise dominate the orchestration trigger's function timeout. The orchestrator therefore enqueues the generation task, marks the request status as generating, and returns within seconds. The generation agent updates the request to completed or failed when it actually finishes.
The single most important correctness mechanism in the diagram pipeline. If a counter increments at the moment a correction is proposed, a malformed SVG burns an attempt even though it could never have produced a valid PNG. By incrementing only after the renderer succeeds, the system guarantees that all five permitted attempts produce verifiable images, and the validator gets five honest chances to find a clean one.
Earlier versions of the validator carried hand-written checks for specific failure classes: overlap between text and shapes, missing angle indicators, off-canvas elements, and so on. Every new diagram type required a new check, and unfamiliar failures slipped through. The current validator removes all of that. It hands the rendered PNG to a multimodal model with a single comprehensive instruction set: examine the image, list anything that would confuse a learner or compromise educational integrity, and (if possible) emit a corrected SVG. The validator's job becomes general; the model's job becomes specific.
For diagram questions the validator receives three inputs: the question text plus markup, the rendered PNG as the primary visual source (base64-encoded), and the SVG source code as reference (not parsed). The model is instructed to base its visual judgement on the PNG, because that is what learners actually see. The SVG is provided only so the model can reason about and propose a corrected version.
Diagrams must never display values that the learner is supposed to calculate. The DiagramAgent's prompt explicitly forbids labelling angles in standard shapes (60° in equilateral triangles, 90° in rectangles) and forbids showing derived measurements. The validator carries the same rule as a second-line check: any verified question whose diagram gives away the answer is refused, regardless of its other merits.
| Bloom level | Target % | Difficulty | Target % |
|---|---|---|---|
| Remember | 5% | 1 — Warm Up | 5% |
| Understand | 15% | 2 — Easy | 10% |
| Apply | 35% | 3 — Medium | 25% |
| Analyze | 25% | 4 — Challenging | 40% |
| Evaluate | 15% | 5 — Hardcore | 20% |
| Create | 5% | (60%+ at 4 or 5) |
The Bloom level is a required field on every generated question. Temperature is held at 1.0 to encourage scenario diversity across requests, so re-running a request produces fresh, non-duplicate questions.
Every AI call uses Zod schemas to enforce structured output. Schemas are passed through a converter that translates them into the providers' native JSON-schema formats; runtime validation guarantees the agent receives a well-formed response or a clear error. Three representative schemas:
Geometric questions must have their diagram before validation runs, or the multimodal model has nothing to inspect. Three independent layers guard this:
When the validator proposes a corrected SVG, a quick structural check runs before the system commits a correction attempt: the SVG must have proper tags, a white background, no forbidden CSS or class attributes, a minimum length, drawing elements, and text labels. A correction that fails structural lint is rejected outright — it does not count against the five-attempt budget. This is distinct from visual analysis; structural lint is fast and deterministic, visual analysis is generic and adaptive.
When a question is moved to the refuted-questions collection, the same atomic write also removes its id from every quiz that referenced it. This is the only way to guarantee that a quiz fetch never returns a dangling pointer to a question that no longer exists.
Learner feedback is the system's strongest signal that something passed validation but failed in the field. Negative feedback is not treated as a directive to edit — it is treated as new validation context. The feedback processing flow re-runs validation with the feedback text included in the prompt; if the validator finds nothing wrong, the feedback is logged and no edit occurs; if the validator confirms an issue, the question enters the bounded improvement loop with a fresh budget.
Aggregate analysis of feedback patterns drives prompt evolution. Specific recurring complaints (long labels truncating in diagrams, angle arcs on the wrong side, MCQ choices that do not match the diagram) become rules in the generation and validation prompts, which collapses the same class of complaint to zero across the next generation cycle.
The 30-minute cleanup sweeper handles every intermediate status: questions awaiting a diagram, questions in processing validation, questions in waiting_for_diagram, and questions whose improvement budget is partially spent. It re-enqueues anything still within retry budget and refutes anything past it. Quota-induced errors get a longer back-off (up to four hours) and up to five additional retries before final refutation.
| Signal | Where to look |
|---|---|
| Per-agent activity | Centralised log stream, filtered by per-agent prefix (for example a QuestionGenerationAgent prefix). |
| Queue depth and latency | Cloud Tasks dashboard, per queue. |
| Validation pass/fail rate | Aggregated from the per-question verified field; admin dashboard surfaces a refute-rate chart. |
| Diagram correction histogram | Distribution of the per-question correction-count field; a spike at 5 indicates a quality regression in the SVG model or prompt. |
| Improvement iterations | Per-question improvement-count field; a spike at 5 indicates ambiguous or under-specified source material. |
The question and diagram pipeline demonstrates a workable production pattern for AI-generated assessment content. Decompose the work into specialised, bounded, idempotent agents on managed cloud infrastructure; isolate each stage in its own queue so failure modes do not cascade; let a multimodal model do the subjective visual judgement against a single comprehensive prompt rather than against hand-written checks; and bound every loop with counters and a cleanup sweeper that catches anything stalled outside the budget.
The two design choices that did the most work were generic visual analysis (which collapses dozens of hand-written diagram checks into one model call that adapts to new failure classes for free) and delayed counter increment (which guarantees the system's five correction attempts are five honest renderings, not five wasted proposals). Both replaced specificity with a small amount of well-placed convention.