PDF Knowledge Demo
RAG Copilot β Document Q&A with Citations
Upload a PDF document (max 5000 words), build a vector index, and interact with contextual chat over your document with full citation tracing. Run automated benchmarks to evaluate retrieval precision, answer relevance, and context coverage.
1. Upload & Index PDF
Select a PDF file (max 5000 words). The system will extract text, chunk it with sliding windows, and build a FAISS vector index.
4. What This Project Does
This is a Retrieval-Augmented Generation (RAG) system that lets you upload a PDF document and ask questions about it in natural language. Instead of searching for keywords, it understands the meaning of your question and finds relevant sections of your document to answer it.
The system solves the problem of extracting information from long documents. Rather than reading an entire PDF to find an answer, you can ask specific questions and get precise responses with references to the exact sections where the information came from.
This project demonstrates three capabilities: semantic search (finding relevant information based on meaning, not just keywords), question answering (generating accurate responses from document content), and automated evaluation (measuring how well the system performs without manual review).
The system includes guardrails to ensure answers come only from the uploaded document, provides citations so you can verify every answer, and includes an automated benchmark suite to measure retrieval quality, answer relevance, and index coverage.
5. Technical Deep-Dive
Purpose and Scope
This system implements a production-grade RAG pipeline for single-document question answering with citation tracing and automated evaluation. The scope is constrained to PDF documents under 5000 words, single-user sessions, and in-memory storage to keep the demo lightweight and deployable on free-tier infrastructure.
Architecture and Design Decisions
The architecture separates concerns into three distinct pipelines: ingest (document processing and indexing), retrieval (semantic search and answer generation), and evaluation (automated quality metrics). This design enables independent testing and optimization of each component. The system uses a stateless REST API with session-based storage, allowing horizontal scaling while maintaining simplicity.
Data Flow and Execution Model
Ingest Pipeline
PDF β Text extraction (pdfplumber) β Word count validation (5000 limit) β Sliding-window chunking (200 words per chunk, 40-word overlap to preserve context across boundaries) β Sentence embedding (sentence-transformers with all-MiniLM-L6-v2 model) β FAISS IndexFlatIP (cosine similarity on L2-normalized embeddings) β In-memory session store.
RAG Pipeline
User query β Sentence embedding β FAISS similarity search (top-4 chunks) β Context injection into GPT-4o Mini prompt β System prompt enforces answer-from-context-only constraint β Response with citations (chunk IDs, text snippets, cosine scores, retrieval latency).
Evaluation Pipeline
Generate synthetic test questions (extract first sentence from random chunks) β Run RAG pipeline on each question β Compute three reference-free metrics: retrieval precision (average top-1 cosine score), answer relevance (cosine similarity between question and answer embeddings), context coverage (fraction of unique chunks retrieved across all queries).
Implementation Details
Chunking Strategy
Sliding-window chunking with 200-word chunks and 40-word overlap ensures that semantic units (paragraphs, concepts) are not split across chunk boundaries, improving retrieval quality. Fixed-size chunks simplify embedding and enable consistent retrieval latency.
Embedding Model Selection
all-MiniLM-L6-v2 balances quality and speed (384-dimensional embeddings, 14M parameters, ~120ms inference on 2-core CPU). Chosen for acceptable semantic similarity performance without requiring GPU infrastructure. Production systems would use larger models (e.g., text-embedding-3-large) or domain-specific fine-tuned models.
Vector Index Design
FAISS IndexFlatIP provides exact nearest-neighbor search with cosine similarity (via L2-normalized embeddings and inner product). No approximate search (ANN) is needed for small document collections (<100 chunks). Production systems with larger corpora would use HNSW or IVF indexes for sub-linear search complexity.
LLM Selection and Prompt Engineering
GPT-4o Mini was chosen for cost efficiency ($0.15 per 1M input tokens vs $2.50 for GPT-4o) while maintaining acceptable answer quality. The system prompt explicitly constrains the model to answer only from provided context and refuse out-of-scope questions. This reduces hallucination risk but requires good retrieval quality to avoid "I don't know" responses.
Full Technology Stack
Backend
- Python 3.11: Language runtime (async support, type hints, performance)
- FastAPI 0.104: Web framework (automatic OpenAPI docs, async, validation with Pydantic)
- pdfplumber 0.10: PDF text extraction (preserves layout, handles complex PDFs)
- sentence-transformers 2.2: Embedding generation (wraps HuggingFace models)
- FAISS 1.7: Vector similarity search (Facebook AI Similarity Search library)
- OpenAI Python SDK 1.3: GPT-4o Mini API client
- Uvicorn 0.24: ASGI server (production-ready, async, HTTP/2)
- python-multipart 0.0.6: Multipart form data parsing (file uploads)
Infrastructure
- Docker: Containerization for reproducible builds and deployment
- Render: Backend hosting (free tier, auto-scaling, zero-config HTTPS, environment variables for secrets)
- Cloudflare Pages: Frontend hosting (global CDN, instant deploys, automatic HTTPS)
Frontend
- Vanilla JavaScript: No framework dependencies (faster load, easier debugging, no build step)
- HTML5 + CSS3: Native file upload API, Fetch API for HTTP, Canvas API for charts
MLOps/LLMOps
- Reference-free evaluation metrics: No human labeling required for quality assessment
- Session-based state management: Isolated user sessions, no shared state pollution
- Citation tracing: Every answer includes provenance (chunk IDs, scores, latency)
- Environment-based configuration: API keys injected via environment variables, never committed to source control
Technical Challenges and Trade-offs
Challenge: Chunking Strategy
Problem: Fixed-size chunks can split semantic units mid-sentence, degrading retrieval quality.
Solution: Sliding-window chunking with 40-word overlap preserves context across boundaries. Trade-off: 20% storage overhead (overlapping words stored multiple times) for better retrieval quality.
Alternative considered: Semantic chunking (split on paragraph boundaries) β rejected due to variable chunk sizes complicating embedding batching and retrieval normalization.
Challenge: Evaluation Without Ground Truth
Problem: Manual labeling of question-answer pairs is expensive and not scalable.
Solution: Reference-free metrics (retrieval precision, answer relevance, context coverage) provide proxy signals for system health without human annotation. Trade-off: Metrics don't capture factual correctness, only correlation and coverage.
Alternative considered: LLM-as-judge evaluation (GPT-4 evaluates answer quality) β rejected due to cost ($2.50 per 1M tokens) and latency (adds 2-5s per evaluation).
Challenge: Cold Start on Free-Tier Hosting
Problem: Render free tier spins down services after 15 minutes of inactivity, causing 30-60s cold starts.
Solution: Health check with retry logic (3 attempts, 3s delay) and user-facing status indicator. Trade-off: Degraded UX on first request, but acceptable for demo purposes.
Alternative considered: Keep-alive pings to prevent spin-down β rejected to stay within free-tier limits and avoid unnecessary API usage.
Challenge: In-Memory Storage Constraints
Problem: FAISS index and document chunks stored in memory, limiting concurrent users and document sizes.
Solution: Session-based storage with 5000-word document limit. Trade-off: No persistence across restarts, single-user sessions only. Production systems would use vector databases (Pinecone, Weaviate, Qdrant) with persistent storage.
Alternative considered: SQLite + FAISS serialization β rejected due to deployment complexity and filesystem write permissions on Render free tier.
Technology Selection Rationale
Why FastAPI?
Automatic OpenAPI documentation (critical for API debugging), native async support (required for OpenAI API calls), and Pydantic validation (type-safe request/response models). Alternative (Flask) lacks async and automatic docs. Alternative (Django) is overkill for stateless API.
Why FAISS?
Industry-standard vector search library with battle-tested performance. Supports exact and approximate search algorithms. Alternatives (Annoy, ScaNN) have narrower feature sets or less Python support. Vector databases (Pinecone, Weaviate) are overkill for in-memory demo use case.
Why GPT-4o Mini?
Cost efficiency ($0.15 per 1M input tokens) with acceptable answer quality. GPT-4o ($2.50 per 1M tokens) provides marginal quality improvement at 16x cost. Open-source alternatives (Llama 3, Mistral) require GPU infrastructure and self-hosting complexity.
Why Vanilla JavaScript?
Zero build step (instant deploys), faster page load (no framework bundle), easier debugging (no transpilation). React/Vue would add complexity without meaningful UX benefit for this interactive demo.
Demo Limitations
- No persistence: Index and chunks stored in-memory, lost on server restart
- 5000-word limit: Prevents memory exhaustion on free-tier infrastructure
- Single-user sessions: No authentication or multi-tenancy support
- Reference-free metrics: Evaluation proxies don't measure factual correctness
- Cold start latency: 30-60s delay on first request after 15-minute idle period
- No production monitoring: No structured logging, error tracking, or observability
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT (Browser) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β PDF Upload β β Chat UI β β Benchmark UI β β
β ββββββββ¬ββββββββ βββββββββ¬βββββββ ββββββββββ¬ββββββ β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββ
β β β
β POST /ingest β POST /chat β POST /eval
β (multipart/form) β (JSON) β (JSON)
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FASTAPI BACKEND (Render) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Routers: /api/v1/ingest, /api/v1/chat, /api/v1/eval β β
β ββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ β
β β β β β
β ββββββββββββββΌββββββ βββββββββββΌββββββββ ββββΌββββββββββββββ β
β β PDF Parser β β Retriever β β Evaluator β β
β β (pdfplumber) β β (FAISS search) β β (metrics calc) β β
β ββββββββββ¬ββββββββββ βββββββββββ¬ββββββββ ββββββββββββββββββ β
β β β β
β ββββββββββΌββββββββββ βββββββββββΌβββββββββ β
β β Chunker β β GPT-4o Mini β β
β β (sliding window) β β (OpenAI API) β β
β ββββββββββ¬ββββββββββ ββββββββββββββββββββ β
β β β
β ββββββββββΌββββββββββββββββββββββββββ β
β β Embedder (sentence-transformers) β β
β β Model: all-MiniLM-L6-v2 β β
β β FAISS IndexFlatIP (cosine sim) β β
β ββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββ β
β β Session Store (in-memory) β β
β β {session_id: {index, chunks, β¦}} β β
β ββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ