whiz.coach Engineering · White Paper

Pedagogically Rigorous Question and Diagram Generation and Validation using Multi‑Agent Orchestration

How five specialised AI agents on Google Cloud generate examination-grade questions, illustrate them with educationally faithful diagrams, and validate both through generic AI visual analysis with multi-layer integrity checks.

Document version
v1.0 · 2025-12-27

Author
Rishit Awasthi

Audience
Engineering, Architecture, Educational Product

Scope
Question pipeline, diagram authoring and iterative correction

1. Summary

The whiz.coach question pipeline turns a learner's request (a topic, a syllabus, a difficulty target) into a verified bank of examination-grade practice questions, each illustrated where appropriate with a publication-quality diagram. Behind that single click sits a five-agent cloud-resident system that decomposes question authoring, diagram drawing, multimodal validation, and pedagogical improvement into separately scalable, separately retryable stages. This paper describes the architecture, the agents, the Google Cloud primitives they share, and the engineering techniques that keep the pipeline correct, bounded, and educationally honest at production scale.

Three load-bearing properties.

Educational integrity is enforced in code. Diagrams are forbidden from showing values learners are supposed to calculate; the validator carries the same rule and refuses any diagram that gives away the answer.
Validation is generic, not pattern-matched. The validator does not look for a list of known SVG bugs. It hands the rendered PNG to a multimodal model and asks "is this clear, accurate, and answer-safe?", catching novel issues without any code change.
Every loop is bounded. Five improvement iterations, five diagram corrections, two queue retries, fifteen attempts hard cap; with a thirty-minute cleanup sweep that recovers anything stalled outside those budgets.

2. The question problem

Generating exam-quality questions is a deceptively hard problem. A single request expands into dozens of artefacts that have to satisfy several independent quality bars simultaneously:

Pedagogical coverage. A balanced set spans Bloom's taxonomy from Remember through Create, with at least 60% of the items placed at the upper difficulty tiers (Challenging / Hardcore).
Style alignment. Questions must match the actual exam patterns used in the syllabus, drawing on worked examples, key terminology from flashcards, and exam-paper structure.
Factual correctness. The stated answer must be the correct answer, and the explanation must be defensible against a multimodal AI checker.
Diagram fidelity. Where a diagram is needed, it must be visually clean, geometrically accurate, and free of any value the learner is meant to derive.
Bounded cost and latency. The whole pipeline must complete inside the Cloud Tasks dispatch deadline and never blow through provider quota budgets.

A single monolithic prompt fails on all of these: it overruns output budgets, couples unrelated failure modes, and produces output whose quality is impossible to verify after the fact. The system therefore decomposes generation into specialised agents and validation into specialised passes.

3. System architecture

The pipeline lives entirely inside Google Cloud. Five specialised agents collaborate through Cloud Tasks queues; Firestore is the source of truth for both request state and per-question status; Cloud Storage hosts the rendered PNG diagrams; a multi-model AI service routes generation calls between Gemini (with thinking enabled) and a GPT-class fallback dedicated to SVG authoring.

flowchart LR subgraph TRIG["Triggers"] direction TB APP[Web app / Admin console] REQ[(requests collection)] Q[(questions collection)] FB[(feedbacks collection)] end subgraph CT["Cloud Tasks queues"] direction TB Q1[orchestration-queue] Q2[question-generation-queue] Q3[diagram-generation-queue] Q4[validation-queue] Q5[improvement-queue] end subgraph AGENTS["Specialised agents (Cloud Functions v2)"] direction TB A1[QuestionOrchestrator] A2[QuestionGenerationAgent] A3[DiagramAgent] A4[ValidationAgent] A5[ImprovementAgent] end subgraph BACKING["Backing services"] direction TB GEM[Gemini 3.1 Pro
thinking enabled] GPT[GPT-class SVG model
fallback to Gemini] FS[(Firestore)] CS[(Cloud Storage)] RESVG[resvg PNG renderer] end APP -- writes --> REQ REQ -- onCreate trigger --> Q1 Q -- onCreate trigger --> Q3 Q -- onCreate trigger --> Q4 FB -- onCreate trigger --> Q1 Q1 --> A1 Q2 --> A2 Q3 --> A3 Q4 --> A4 Q5 --> A5 A1 -- enqueue --> Q2 A2 -- write --> Q A3 -- request svg --> GPT A3 -- fallback --> GEM A3 -- render --> RESVG A3 -- upload png --> CS A3 -- enqueue --> Q4 A4 -- multimodal --> GEM A4 -- enqueue --> Q5 A4 -- enqueue --> Q3 A5 -- enqueue --> Q4 classDef ag fill:#ffe9b3,stroke:#7c5800; classDef qx fill:#f8fafc,stroke:#475569; classDef bk fill:#d6f0d6,stroke:#265a26; class AGENTS,A1,A2,A3,A4,A5 ag; class CT,Q1,Q2,Q3,Q4,Q5 qx; class BACKING,GEM,GPT,FS,CS,RESVG bk;

Figure 1. Top-level topology. Firestore triggers feed Cloud Tasks queues; queues dispatch HTTP calls to a single Cloud Functions endpoint that routes to the named agent.

Agents Queues Backing services

4. The agent cast

Each agent extends a common base class, picks up a Cloud Tasks message, transitions per-question or per-request state in Firestore, and either chains to the next agent or returns. Five agents are in production for the question pipeline.

Agent	Queue	Thinking	Responsibility
QuestionOrchestrator	orchestration	low	Reads request documents, validates subscription quotas, applies hourly rate limits, and enqueues a single fire-and-forget generation task. Returns immediately so the trigger does not hold the function timeout.
QuestionGenerationAgent	question-generation	medium	Fetches structured syllabus data (agent briefs, key terms from flashcards, exam patterns, worked-example style), generates 10 to 100 questions scaled by total study time, applies a Bloom's taxonomy distribution, and tags every question with a difficulty 1 to 5 (60%+ at 4 or 5).
DiagramAgent	diagram-generation	high	Authors SVG with a GPT-class model (primary) or Gemini (fallback) for strong math and visual reasoning; renders to PNG at 800×600; uploads to Cloud Storage; enqueues validation. Also regenerates PNG from corrected SVG in the iterative correction flow.
ValidationAgent	validation	medium	Multimodal pass with text plus the rendered PNG plus the SVG as reference. Checks factual correctness, answer accuracy, explanation clarity, and (critically) whether the diagram gives away the answer. Returns a corrected SVG when it detects any visual or geometric issue.
ImprovementAgent	improvement	medium	Receives validation feedback and rewrites the question, the explanation, or the MCQ choices. Bounded to 5 iterations; failure beyond that moves the question to the refuted-questions collection with a tracked reason.

A sixth component, the cleanup sweeper, is a 30-minute Cloud Scheduler job rather than an agent. It re-enqueues questions stuck in any intermediate validation status and refutes anything that has exceeded its retry budget.

5. Multi-agent interactions

5.1 The generation flow

A learner or admin writing to the requests collection is the single entry point. The Firestore trigger never blocks; it just enqueues an orchestration task and returns.

flowchart TD U([Admin or learner submits request]) --> R[(requests document created)] R -- onCreate trigger --> ORCH[QuestionOrchestrator] ORCH -- validates quota and rate limits --> CHECK{within budget?} CHECK -- no --> FAIL([request marked failed]) CHECK -- yes --> ENQ[enqueue QuestionGenerationAgent task] ENQ -- fire-and-forget --> GEN[QuestionGenerationAgent] GEN -- fetch agent briefs, key terms, exam patterns --> CTX[structured syllabus context] CTX -- Gemini call with Zod schema --> Q[questions written to Firestore] Q -- onCreate triggers per question --> NEXT([Per-question validation path])

Figure 2. From request to per-question validation. The orchestrator never waits for generation.

5.2 The diagram + validation sequential flow

When a question requires a diagram, the system has to render the diagram before validation can run, because the multimodal validator needs the PNG to inspect. The trigger inspects the requiresDiagram flag and chooses which queue to enqueue first.

sequenceDiagram participant FS as Firestore trigger participant DA as DiagramAgent participant CS as Cloud Storage participant VA as ValidationAgent participant GPT as GPT-class SVG model participant GEM as Gemini FS->>FS: question onCreate FS->>FS: requiresDiagram? alt requires diagram FS->>DA: enqueue generate-diagram DA->>GPT: SVG codegen with viewBox 800x600 GPT-->>DA: SVG with short labels only DA->>DA: render with resvg DA->>CS: upload PNG DA->>VA: enqueue validate-question else no diagram FS->>VA: enqueue validate-question directly end VA->>VA: read question and PNG VA->>GEM: text plus PNG plus SVG reference GEM-->>VA: verdict (verified, issues, correctedDiagramSvg?) alt verified true and diagram answer-safe VA->>VA: mark question verified else issues found VA->>VA: route to correction or improvement end

Figure 3. Sequential diagram-then-validation. The validator's input is always a fully rendered PNG, never an unrendered SVG.

5.3 The iterative diagram correction loop

The defining design choice of the diagram pipeline is the delayed counter increment. When the validator detects a visual issue and proposes a corrected SVG, the system does not count that attempt yet. The counter increments only after the corrected SVG has successfully rendered to a PNG. This prevents the system from burning correction attempts on malformed SVG that could never have produced a verifiable image in the first place.

flowchart TD V[ValidationAgent examines PNG] --> ANY{issues detected?} ANY -- no --> OK([mark verified]) ANY -- yes --> COR{AI provided corrected SVG?} COR -- no --> REF([move to refuted-questions]) COR -- yes --> CNT{correction count < 5?} CNT -- no --> REF CNT -- yes --> STR{valid SVG structure?} STR -- no --> REF STR -- yes --> SAVE[save corrected SVG
counter NOT incremented] SAVE --> RPNG[DiagramAgent: render PNG] RPNG --> CONV{PNG render success?} CONV -- no --> RETRY[Cloud Tasks retries] RETRY --> RPNG CONV -- yes --> INC[increment correction count
only now] INC --> CHECK{count == 5?} CHECK -- no --> REV[trigger re-validation] CHECK -- yes --> FINAL[trigger FINAL re-validation] REV --> V FINAL --> V classDef ok fill:#d6f0d6,stroke:#265a26; classDef bad fill:#f8d3c2,stroke:#8a3315; classDef step fill:#f8fafc,stroke:#475569; class OK ok; class REF bad; class SAVE,RPNG,INC,REV,FINAL step;

Figure 4. Iterative diagram correction with delayed counter increment. Counter increments only after a PNG successfully renders.

5.4 The progressive angle-arc fallback

Angle arcs are the most consistently failure-prone primitive in educational SVG. The arc has to curve into the shape, not away from it, which depends on the SVG path's sweep-flag. Models routinely get this wrong. The system uses a three-tier fallback that escalates from "fix the arc" to "drop the arc but keep the label" to "give up on this question":

Tier	Attempts	Strategy
Tier 1	1 to 3	Ask the validator to fix the arc by flipping the sweep-flag and re-positioning the arc.
Tier 2	4 to 5	Remove the arc path entirely but keep the angle text label (for example "60°"). A diagram with labels but no arcs is still educationally valid.
Tier 3	past 5	Move the question to the refuted-questions collection with a tracked reason. No further work.

5.5 The feedback processing flow

When a learner submits negative feedback on a question, the feedback document creation triggers an orchestration task that fans out to validation with the feedback as additional context. If the re-validation surfaces issues, the question enters the improvement loop; if not, the feedback is logged and no change is made.

sequenceDiagram participant L as Learner participant FS as Firestore trigger participant ORCH as QuestionOrchestrator participant VA as ValidationAgent participant IA as ImprovementAgent L->>FS: feedback document created FS->>ORCH: enqueue feedback processing ORCH->>VA: enqueue re-validate with feedback context VA->>VA: multimodal validation alt issues confirmed VA->>IA: enqueue improve-question IA->>IA: rewrite question, choices, or explanation IA->>VA: enqueue re-validate else feedback unfounded VA->>VA: log feedback, no change end

Figure 5. Learner-feedback loop. Each negative comment is treated as new validation context, not as a direct edit instruction.

6. Platform and external services

The pipeline composes a small set of Google Cloud primitives and a single external AI provider, each chosen for a specific operational property.

Component	Role	How the pipeline uses it
Cloud Functions v2	Agent runtime	A single HTTP entry point dispatches each Cloud Tasks message to the named agent. Per-function timeout is 30 minutes (1800 s) for the task processor; queue dispatch deadlines vary from 600 s default to 1800 s for long-running queues like question generation.
Cloud Tasks	Async messaging	Five queues — orchestration, question generation, diagram generation, validation, improvement — each with its own rate-limit and concurrency profile. Uniform retry budget of 2 retries (3 total attempts) with 60 s to 1800 s exponential backoff. Agents can request custom 20 to 240 minute delays for rate-limit retries via a re-queue mechanism.
Firestore	State + idempotency	Source of truth for requests, questions, quizzes, refuted-questions, and feedbacks collections. Triggers on document creation enqueue the next stage. Permanent flags (for example a validation-triggered marker) prevent re-triggering on subsequent updates.
Cloud Storage	Diagram host	Per-syllabus prefixes hold the rendered PNG diagrams at a stable path keyed by question id, so the validator can fetch the image without any cross-reference.
Vertex AI Gemini	Primary AI	Used for question generation, validation, and improvement at `thinkingLevel: 'medium'`; also for SVG fallback when the primary SVG model fails. Runs on the global endpoint; structured output is enforced through Zod schemas converted to Gemini's native JSON-schema format.
GPT-class SVG model	SVG authoring	Primary path for diagram generation because publication-quality SVG benefits from strong math-and-visual reasoning. Hosted via the OpenAI Responses API. Fallback path is Gemini, with the same Zod schema.
Cloud Scheduler	Maintenance	A 30-minute cleanup sweeper detects stuck questions in any intermediate status and either re-enqueues them or refutes them once retry budgets are exhausted.

6.1 Queue topology

Each queue isolates one stage so a backlog in one cannot stall another. Queue-level concurrency caps and rates per second map directly onto provider quotas.

Queue	Concurrency	Rate	Retries
orchestration-queue	20	10 / sec	2
question-generation-queue	5	2 / sec	2
diagram-generation-queue	5	2 / sec	2
validation-queue	8	3 / sec	2
improvement-queue	5	2 / sec	2

7. Engineering techniques

7.1 Loop prevention in three layers

Permanent trigger flags. Each Firestore trigger sets a flag (for example a validation-triggered marker) inside an atomic transaction. The flag is never reset, so the trigger fires exactly once per document, regardless of how many subsequent updates land.
Status checks at every agent entry point. Every action begins with an early return if the question is already verified, refuted, or in a status the agent does not own.
Task-chain tracking. Parent-task ids propagate through enqueues; chains deeper than ten hops are rejected before they can form a cycle.

7.2 Bounded iteration

Bound	Where	What it prevents
5 improvement iterations	ImprovementAgent	Endless validate ↔ improve cycles on questions the validator and improver cannot agree on.
5 diagram corrections (after PNG success)	ValidationAgent + DiagramAgent	Repeated cosmetic patches that never converge to a clean diagram.
Tier 1 / Tier 2 / Tier 3 angle-arc fallback	ValidationAgent	Models repeatedly misplacing angle arcs; falls back to "label-only" before refuting.
2 Cloud Tasks retries (3 total)	Queue config	Transient provider failures masking as permanent.
15 attempts hard cap	Base-agent attempt counter	Any path that escaped the agent-specific bounds.
30-minute cleanup sweep	Cloud Scheduler	Questions stalled in an intermediate status because a task was dropped, throttled, or timed out.

7.3 Fire-and-forget for long operations

The orchestrator never waits for the question generator. Generation can take several minutes for a 100-item request and would otherwise dominate the orchestration trigger's function timeout. The orchestrator therefore enqueues the generation task, marks the request status as generating, and returns within seconds. The generation agent updates the request to completed or failed when it actually finishes.

7.4 Delayed counter increment

The single most important correctness mechanism in the diagram pipeline. If a counter increments at the moment a correction is proposed, a malformed SVG burns an attempt even though it could never have produced a valid PNG. By incrementing only after the renderer succeeds, the system guarantees that all five permitted attempts produce verifiable images, and the validator gets five honest chances to find a clean one.

7.5 Generic visual analysis

Earlier versions of the validator carried hand-written checks for specific failure classes: overlap between text and shapes, missing angle indicators, off-canvas elements, and so on. Every new diagram type required a new check, and unfamiliar failures slipped through. The current validator removes all of that. It hands the rendered PNG to a multimodal model with a single comprehensive instruction set: examine the image, list anything that would confuse a learner or compromise educational integrity, and (if possible) emit a corrected SVG. The validator's job becomes general; the model's job becomes specific.

"Including but not limited to." The validator prompt explicitly invites the model to flag any visual issue, not just the examples in the prompt. This is what lets the system adapt to new diagram types without code changes.

7.6 Multimodal validation

For diagram questions the validator receives three inputs: the question text plus markup, the rendered PNG as the primary visual source (base64-encoded), and the SVG source code as reference (not parsed). The model is instructed to base its visual judgement on the PNG, because that is what learners actually see. The SVG is provided only so the model can reason about and propose a corrected version.

7.7 Educational integrity in code

Diagrams must never display values that the learner is supposed to calculate. The DiagramAgent's prompt explicitly forbids labelling angles in standard shapes (60° in equilateral triangles, 90° in rectangles) and forbids showing derived measurements. The validator carries the same rule as a second-line check: any verified question whose diagram gives away the answer is refused, regardless of its other merits.

7.8 Bloom's taxonomy and difficulty distribution

Bloom level	Target %	Difficulty	Target %
Remember	5%	1 — Warm Up	5%
Understand	15%	2 — Easy	10%
Apply	35%	3 — Medium	25%
Analyze	25%	4 — Challenging	40%
Evaluate	15%	5 — Hardcore	20%
Create	5%	(60%+ at 4 or 5)

The Bloom level is a required field on every generated question. Temperature is held at 1.0 to encourage scenario diversity across requests, so re-running a request produces fresh, non-duplicate questions.

8. Quality controls

8.1 Schema-validated structured output

Every AI call uses Zod schemas to enforce structured output. Schemas are passed through a converter that translates them into the providers' native JSON-schema formats; runtime validation guarantees the agent receives a well-formed response or a clear error. Three representative schemas:

Question array. Each item carries question text, MCQ choices (when applicable), the canonical answer, the explanation, the Bloom level, the difficulty 1 to 5, a flag for whether a diagram is required, and the diagram instructions.
Diagram SVG. A complete SVG string plus optional generation notes for downstream debugging.
Validation verdict. A verified boolean, an optional refutation reason, per-axis issue booleans (explanation, answer, diagram, question), and an optional corrected SVG when the validator detected and can fix a diagram issue.

8.2 Multi-layer protection for missing diagrams

Geometric questions must have their diagram before validation runs, or the multimodal model has nothing to inspect. Three independent layers guard this:

Primary block. Validation refuses to start if a diagram is required but the PNG URL is missing.
Safety net. A post-validation check overrides any AI verdict if it discovers, after the fact, that the diagram was missing during the check.
Enhanced logging. Every validation logs comprehensive diagram status (present / missing, generation status, correction count) so missed cases are traceable.

8.3 SVG structural lint before correction

When the validator proposes a corrected SVG, a quick structural check runs before the system commits a correction attempt: the SVG must have proper tags, a white background, no forbidden CSS or class attributes, a minimum length, drawing elements, and text labels. A correction that fails structural lint is rejected outright — it does not count against the five-attempt budget. This is distinct from visual analysis; structural lint is fast and deterministic, visual analysis is generic and adaptive.

8.4 Refute cleanup

When a question is moved to the refuted-questions collection, the same atomic write also removes its id from every quiz that referenced it. This is the only way to guarantee that a quiz fetch never returns a dangling pointer to a question that no longer exists.

9. Closing the loop

Learner feedback is the system's strongest signal that something passed validation but failed in the field. Negative feedback is not treated as a directive to edit — it is treated as new validation context. The feedback processing flow re-runs validation with the feedback text included in the prompt; if the validator finds nothing wrong, the feedback is logged and no edit occurs; if the validator confirms an issue, the question enters the bounded improvement loop with a fresh budget.

Aggregate analysis of feedback patterns drives prompt evolution. Specific recurring complaints (long labels truncating in diagrams, angle arcs on the wrong side, MCQ choices that do not match the diagram) become rules in the generation and validation prompts, which collapses the same class of complaint to zero across the next generation cycle.

10. Operational characteristics

10.1 Recovery

The 30-minute cleanup sweeper handles every intermediate status: questions awaiting a diagram, questions in processing validation, questions in waiting_for_diagram, and questions whose improvement budget is partially spent. It re-enqueues anything still within retry budget and refutes anything past it. Quota-induced errors get a longer back-off (up to four hours) and up to five additional retries before final refutation.

10.2 Telemetry

Signal	Where to look
Per-agent activity	Centralised log stream, filtered by per-agent prefix (for example a QuestionGenerationAgent prefix).
Queue depth and latency	Cloud Tasks dashboard, per queue.
Validation pass/fail rate	Aggregated from the per-question verified field; admin dashboard surfaces a refute-rate chart.
Diagram correction histogram	Distribution of the per-question correction-count field; a spike at 5 indicates a quality regression in the SVG model or prompt.
Improvement iterations	Per-question improvement-count field; a spike at 5 indicates ambiguous or under-specified source material.

10.3 Deployment isolation

Per-agent function deploys. Each agent is a separately deployable Cloud Function; a regression in one stage can be rolled back without redeploying the rest.
Queue pause as emergency control. A misbehaving stage can be paused at the queue level (in-flight tasks drain; new tasks accumulate) without affecting any other stage.
Schema versioning. Zod schemas are versioned alongside the agent; an agent will never accept output from a model whose schema does not match the deployed version.

11. Conclusion

The question and diagram pipeline demonstrates a workable production pattern for AI-generated assessment content. Decompose the work into specialised, bounded, idempotent agents on managed cloud infrastructure; isolate each stage in its own queue so failure modes do not cascade; let a multimodal model do the subjective visual judgement against a single comprehensive prompt rather than against hand-written checks; and bound every loop with counters and a cleanup sweeper that catches anything stalled outside the budget.

The two design choices that did the most work were generic visual analysis (which collapses dozens of hand-written diagram checks into one model call that adapts to new failure classes for free) and delayed counter increment (which guarantees the system's five correction attempts are five honest renderings, not five wasted proposals). Both replaced specificity with a small amount of well-placed convention.

Topics covered in more depth in companion material.

Canonical implementation guide for the question, diagram, validation, and improvement agents.
Cloud Tasks agent system: queue topology, loop-prevention principles, recovery patterns.
Multi-model AI service: Gemini-with-thinking primary, GPT-class fallback, Zod schema enforcement.
Companion paper: Visually Engaging Study Material Generation and Validation using Multi‑Agent Orchestration.