PDF Knowledge Demo

RAG Copilot β€” Document Q&A with Citations

Upload a PDF document (max 5000 words), build a vector index, and interact with contextual chat over your document with full citation tracing. Run automated benchmarks to evaluate retrieval precision, answer relevance, and context coverage.

RAG FAISS Citations Evaluation System Offline

1. Upload & Index PDF

Select a PDF file (max 5000 words). The system will extract text, chunk it with sliding windows, and build a FAISS vector index.

πŸ“„
Drop PDF here or click to browse
Maximum 5000 words

4. What This Project Does

This is a Retrieval-Augmented Generation (RAG) system that lets you upload a PDF document and ask questions about it in natural language. Instead of searching for keywords, it understands the meaning of your question and finds relevant sections of your document to answer it.

The system solves the problem of extracting information from long documents. Rather than reading an entire PDF to find an answer, you can ask specific questions and get precise responses with references to the exact sections where the information came from.

This project demonstrates three capabilities: semantic search (finding relevant information based on meaning, not just keywords), question answering (generating accurate responses from document content), and automated evaluation (measuring how well the system performs without manual review).

The system includes guardrails to ensure answers come only from the uploaded document, provides citations so you can verify every answer, and includes an automated benchmark suite to measure retrieval quality, answer relevance, and index coverage.

5. Technical Deep-Dive

Purpose and Scope

This system implements a production-grade RAG pipeline for single-document question answering with citation tracing and automated evaluation. The scope is constrained to PDF documents under 5000 words, single-user sessions, and in-memory storage to keep the demo lightweight and deployable on free-tier infrastructure.

Architecture and Design Decisions

The architecture separates concerns into three distinct pipelines: ingest (document processing and indexing), retrieval (semantic search and answer generation), and evaluation (automated quality metrics). This design enables independent testing and optimization of each component. The system uses a stateless REST API with session-based storage, allowing horizontal scaling while maintaining simplicity.

Data Flow and Execution Model

Ingest Pipeline

PDF β†’ Text extraction (pdfplumber) β†’ Word count validation (5000 limit) β†’ Sliding-window chunking (200 words per chunk, 40-word overlap to preserve context across boundaries) β†’ Sentence embedding (sentence-transformers with all-MiniLM-L6-v2 model) β†’ FAISS IndexFlatIP (cosine similarity on L2-normalized embeddings) β†’ In-memory session store.

RAG Pipeline

User query β†’ Sentence embedding β†’ FAISS similarity search (top-4 chunks) β†’ Context injection into GPT-4o Mini prompt β†’ System prompt enforces answer-from-context-only constraint β†’ Response with citations (chunk IDs, text snippets, cosine scores, retrieval latency).

Evaluation Pipeline

Generate synthetic test questions (extract first sentence from random chunks) β†’ Run RAG pipeline on each question β†’ Compute three reference-free metrics: retrieval precision (average top-1 cosine score), answer relevance (cosine similarity between question and answer embeddings), context coverage (fraction of unique chunks retrieved across all queries).

Implementation Details

Chunking Strategy

Sliding-window chunking with 200-word chunks and 40-word overlap ensures that semantic units (paragraphs, concepts) are not split across chunk boundaries, improving retrieval quality. Fixed-size chunks simplify embedding and enable consistent retrieval latency.

Embedding Model Selection

all-MiniLM-L6-v2 balances quality and speed (384-dimensional embeddings, 14M parameters, ~120ms inference on 2-core CPU). Chosen for acceptable semantic similarity performance without requiring GPU infrastructure. Production systems would use larger models (e.g., text-embedding-3-large) or domain-specific fine-tuned models.

Vector Index Design

FAISS IndexFlatIP provides exact nearest-neighbor search with cosine similarity (via L2-normalized embeddings and inner product). No approximate search (ANN) is needed for small document collections (<100 chunks). Production systems with larger corpora would use HNSW or IVF indexes for sub-linear search complexity.

LLM Selection and Prompt Engineering

GPT-4o Mini was chosen for cost efficiency ($0.15 per 1M input tokens vs $2.50 for GPT-4o) while maintaining acceptable answer quality. The system prompt explicitly constrains the model to answer only from provided context and refuse out-of-scope questions. This reduces hallucination risk but requires good retrieval quality to avoid "I don't know" responses.

Full Technology Stack

Backend

Infrastructure

Frontend

MLOps/LLMOps

Technical Challenges and Trade-offs

Challenge: Chunking Strategy

Problem: Fixed-size chunks can split semantic units mid-sentence, degrading retrieval quality.
Solution: Sliding-window chunking with 40-word overlap preserves context across boundaries. Trade-off: 20% storage overhead (overlapping words stored multiple times) for better retrieval quality.
Alternative considered: Semantic chunking (split on paragraph boundaries) β€” rejected due to variable chunk sizes complicating embedding batching and retrieval normalization.

Challenge: Evaluation Without Ground Truth

Problem: Manual labeling of question-answer pairs is expensive and not scalable.
Solution: Reference-free metrics (retrieval precision, answer relevance, context coverage) provide proxy signals for system health without human annotation. Trade-off: Metrics don't capture factual correctness, only correlation and coverage.
Alternative considered: LLM-as-judge evaluation (GPT-4 evaluates answer quality) β€” rejected due to cost ($2.50 per 1M tokens) and latency (adds 2-5s per evaluation).

Challenge: Cold Start on Free-Tier Hosting

Problem: Render free tier spins down services after 15 minutes of inactivity, causing 30-60s cold starts.
Solution: Health check with retry logic (3 attempts, 3s delay) and user-facing status indicator. Trade-off: Degraded UX on first request, but acceptable for demo purposes.
Alternative considered: Keep-alive pings to prevent spin-down β€” rejected to stay within free-tier limits and avoid unnecessary API usage.

Challenge: In-Memory Storage Constraints

Problem: FAISS index and document chunks stored in memory, limiting concurrent users and document sizes.
Solution: Session-based storage with 5000-word document limit. Trade-off: No persistence across restarts, single-user sessions only. Production systems would use vector databases (Pinecone, Weaviate, Qdrant) with persistent storage.
Alternative considered: SQLite + FAISS serialization β€” rejected due to deployment complexity and filesystem write permissions on Render free tier.

Technology Selection Rationale

Why FastAPI?

Automatic OpenAPI documentation (critical for API debugging), native async support (required for OpenAI API calls), and Pydantic validation (type-safe request/response models). Alternative (Flask) lacks async and automatic docs. Alternative (Django) is overkill for stateless API.

Why FAISS?

Industry-standard vector search library with battle-tested performance. Supports exact and approximate search algorithms. Alternatives (Annoy, ScaNN) have narrower feature sets or less Python support. Vector databases (Pinecone, Weaviate) are overkill for in-memory demo use case.

Why GPT-4o Mini?

Cost efficiency ($0.15 per 1M input tokens) with acceptable answer quality. GPT-4o ($2.50 per 1M tokens) provides marginal quality improvement at 16x cost. Open-source alternatives (Llama 3, Mistral) require GPU infrastructure and self-hosting complexity.

Why Vanilla JavaScript?

Zero build step (instant deploys), faster page load (no framework bundle), easier debugging (no transpilation). React/Vue would add complexity without meaningful UX benefit for this interactive demo.

Demo Limitations

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CLIENT (Browser)                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ PDF Upload   β”‚  β”‚ Chat UI      β”‚  β”‚ Benchmark UI β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚                  β”‚
          β”‚ POST /ingest     β”‚ POST /chat       β”‚ POST /eval
          β”‚ (multipart/form) β”‚ (JSON)           β”‚ (JSON)
          β–Ό                  β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FASTAPI BACKEND (Render)                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Routers: /api/v1/ingest, /api/v1/chat, /api/v1/eval      β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚               β”‚                  β”‚             β”‚                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ PDF Parser       β”‚  β”‚ Retriever       β”‚  β”‚ Evaluator      β”‚  β”‚
β”‚  β”‚ (pdfplumber)     β”‚  β”‚ (FAISS search)  β”‚  β”‚ (metrics calc) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚           β”‚                      β”‚                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚  β”‚ Chunker          β”‚  β”‚ GPT-4o Mini      β”‚                     β”‚
β”‚  β”‚ (sliding window) β”‚  β”‚ (OpenAI API)     β”‚                     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚           β”‚                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚  β”‚ Embedder (sentence-transformers) β”‚                           β”‚
β”‚  β”‚ Model: all-MiniLM-L6-v2          β”‚                           β”‚
β”‚  β”‚ FAISS IndexFlatIP (cosine sim)   β”‚                           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚  β”‚ Session Store (in-memory)        β”‚                           β”‚
β”‚  β”‚ {session_id: {index, chunks, …}} β”‚                           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜