Ask a question
This demo drafts only from retrieved sources. If evidence is weak, it refuses (Strict policy).
Run a query to see a grounded draft with citations. Use Strict to enforce refusals on weak evidence.
Retrieval trace (top-k sources, similarity, snippets)
Evaluation harness
A lightweight, deterministic evaluation set for screening. Includes positive queries and a negative control.
| Test | Query | Expected | Refusal | Retrieval@k | Citations | Top similarity | Notes |
|---|---|---|---|---|---|---|---|
| Run an evaluation to populate results. | |||||||
Interpretation: not a benchmark against other models; it demonstrates evaluation habits (retrieval quality proxies, refusal correctness, traceability).
Production mapping
This portfolio build is client-only by design. Below is how the same system maps to a production architecture and the operational controls that typically matter.
Reference architecture
Ingestion → Chunking → Embeddings → Vector store → RAG API → Model/router → Traces + Eval.
Observability & governance
Production RAG is an observability problem: capture retrieval traces, prompt/response metadata, latency and cost budgets, and enforce access controls for sensitive documents.
Safety controls
- Prompt injection defence: treat retrieved content as untrusted data; allow-list behaviours.
- Grounding policy: refuse or ask clarifying questions when evidence is weak.
- Output constraints: structured responses and citation requirements for high-stakes domains.
Operational checklist
- Latency budgets: cache hot queries; optimise top-k; stream responses when possible.
- Cost control: routing + prompt compression; fall back to smaller models.
- Evaluation: regression tests on labelled queries (precision@k, citation correctness).
- Incidents: trace export for debugging; rollback on prompt/policy changes.
ATS triggers surfaced here: OpenTelemetry, Vector Store, pgvector, RBAC, PII Redaction, Rate Limiting, Caching.