RAG System Architecture

📘 Simple RAG Architecture

Best for: Documents < 50 pages, Fast prototyping

Document Loading

PDF → Clean Text (PyPDF2)

↓

Chunking

2000 words, 200 overlap

↓

Long Context

All chunks → Claude Sonnet 4.5

↓

Answer Generation

Claude (200K context window)

2-3s

Latency

$15

Per 1K queries

85%

Precision

5min

Setup time

NEW: Redis Cache

🚀 Enhanced RAG + Cache

Large docs, Multi-document, Caching

⚡ Cache Check (Optional)

Redis → Instant if cached (<10ms, $0.00)

↓ (if cache miss)

Smart Processing

PDF → Structured chunks (800 words)

↓

Embeddings

Cohere (1024-dim vectors)

↓

Vector Storage

Pinecone (cosine similarity)

↓

Retrieval

Query → Top-20 candidates

↓

Re-ranking

Cohere → Top-5 best

↓

Generation

Claude Sonnet 4.5

↓

⚡ Cache Store

Save to Redis (1hr TTL)

<10ms

Cached latency

$3.50

Per 1K queries (80% cache)

95%

Precision

80%

Cost savings

🔧 System Components

📄 rag_app.py

Simple RAG - Auto-loads .env, single API

🚀 rag_app_enhanced.py

Enhanced RAG + Optional Redis cache

🎯 qa_generator.py

Generate evaluation datasets (50-100 Q&A pairs, 5 types)

📊 rag_evaluator.py

4-metric evaluation: faithfulness, relevancy, context, correctness

💰 cost_calculator.py

Real cost analysis (Simple vs Enhanced vs Cache)

🐳 Docker

Production-ready Dockerfile + docker-compose

Comparison Matrix

Feature	Simple RAG	Enhanced RAG	Enhanced + Cache
Document Size	< 50 pages	Unlimited	Unlimited
Setup Complexity	✓ Easy (1 API)	Moderate (3 APIs)	Moderate (3 APIs + Redis)
Multi-document	✗ No	✓ Yes	✓ Yes
Precision	85%	95%	95%
Cost/1K queries	$15	$17.50	$3.50 🎯
Latency	2-3s	2.5-3.5s	<10ms (cached)
Repeated Queries	Full cost each time	Full cost each time	✓ Free from cache
Evaluation Support	✓ Yes	✓ Yes	✓ Yes
Best for	Prototypes	Production	High-volume production

📊 Evaluation Pipeline

Automated groundtruth generation and 4-metric evaluation system

1️⃣ Q&A Generator

qa_generator.py

Auto-generate Q&A pairs
5 question types (factual, conceptual, multi-hop)
Difficulty levels (easy, medium, hard)
Export to JSON
Claude-powered generation

2️⃣ RAG System

Run on test dataset

Load evaluation dataset
Process each question
Collect RAG responses
Measure retrieval quality
Track latency & costs

3️⃣ Evaluator

rag_evaluator.py

4-metric evaluation
LLM-as-judge scoring
Detailed breakdowns
Comparative analysis
Export results

Evaluation Metrics

⭐

Faithfulness

Answer supported by retrieved context

🎯

Answer Relevancy

Directly addresses the question

📚

Context Relevancy

Retrieval quality (right chunks)

✓

Correctness

Matches ground truth answer

Evaluation Workflow

📝

Generate

50-100 Q&A pairs

→

🤖

Run RAG

Test both systems

→

📊

Evaluate

4-metric scoring

→

✓

Improve

Iterate on weak areas

🌐 Production Deployment

🚀 Render Deployment

Live on Render.com
4 apps running
Auto-deploy from GitHub
Environment variables configured
Optional Redis add-on

🐳 Docker Setup

Production Dockerfile
docker-compose.yml
Redis container included
Multi-app orchestration
Health checks configured

🔐 Configuration

Auto-load from .env
Secure API key management
Optional cache (USE_CACHE)
.gitignore for secrets
Environment-based config

Tech Stack

🤖 LLM

Claude Sonnet 4.5 (Anthropic)

🔢 Embeddings

Cohere embed-multilingual-v3.0

💾 Vector DB

Pinecone (serverless)

🎯 Re-ranking

Cohere rerank-multilingual-v3.0

⚡ Cache

Redis 7 (optional)

🖥️ Framework

Streamlit + Python 3.11

🚀 Complete RAG System Architecture