System

RAG Knowledge Copilot

A client-only RAG demo that surfaces production signals: retrieval tracing, grounded drafting, guardrails, a lightweight evaluation harness, and a pragmatic production mapping. Designed to be reliable on Cloudflare Pages (no backend calls, no CORS failure modes).

RAG LLMOps Retrieval Tracing Guardrails Evaluation Harness

Ask a question

This demo drafts only from retrieved sources. If evidence is weak, it refuses (Strict policy).

Run a query to see a grounded draft with citations. Use Strict to enforce refusals on weak evidence.

Retrieval trace (top-k sources, similarity, snippets)

Evaluation harness

A lightweight, deterministic evaluation set for screening. Includes positive queries and a negative control.

overall score refusal correctness retrieval@k citation presence trace completeness
Test Query Expected Refusal Retrieval@k Citations Top similarity Notes
Run an evaluation to populate results.

Interpretation: not a benchmark against other models; it demonstrates evaluation habits (retrieval quality proxies, refusal correctness, traceability).

Production mapping

This portfolio build is client-only by design. Below is how the same system maps to a production architecture and the operational controls that typically matter.

Reference architecture

IngestionChunkingEmbeddingsVector storeRAG APIModel/routerTraces + Eval.

Vector Store pgvector Pinecone OpenSearch Redis Cache Rate Limiting

Observability & governance

Production RAG is an observability problem: capture retrieval traces, prompt/response metadata, latency and cost budgets, and enforce access controls for sensitive documents.

OpenTelemetry Tracing RBAC PII Redaction Audit Logs

Safety controls

  • Prompt injection defence: treat retrieved content as untrusted data; allow-list behaviours.
  • Grounding policy: refuse or ask clarifying questions when evidence is weak.
  • Output constraints: structured responses and citation requirements for high-stakes domains.
Guardrails Refusal Policies Groundedness

Operational checklist

  • Latency budgets: cache hot queries; optimise top-k; stream responses when possible.
  • Cost control: routing + prompt compression; fall back to smaller models.
  • Evaluation: regression tests on labelled queries (precision@k, citation correctness).
  • Incidents: trace export for debugging; rollback on prompt/policy changes.
Caching Routing Regression Tests Rollback

ATS triggers surfaced here: OpenTelemetry, Vector Store, pgvector, RBAC, PII Redaction, Rate Limiting, Caching.