🚀 Complete RAG System Architecture

Deployed on Render • Optional Redis Caching

📘 Simple RAG Architecture

Best for: Documents < 50 pages, Fast prototyping

1
Document Loading
PDF → Clean Text (PyPDF2)
2
Chunking
2000 words, 200 overlap
3
Long Context
All chunks → Claude Sonnet 4.5
4
Answer Generation
Claude (200K context window)
2-3s
Latency
$15
Per 1K queries
85%
Precision
5min
Setup time
NEW: Redis Cache

🚀 Enhanced RAG + Cache

Large docs, Multi-document, Caching

0
⚡ Cache Check (Optional)
Redis → Instant if cached (<10ms, $0.00)
↓ (if cache miss)
1
Smart Processing
PDF → Structured chunks (800 words)
2
Embeddings
Cohere (1024-dim vectors)
3
Vector Storage
Pinecone (cosine similarity)
4
Retrieval
Query → Top-20 candidates
5
Re-ranking
Cohere → Top-5 best
6
Generation
Claude Sonnet 4.5
7
⚡ Cache Store
Save to Redis (1hr TTL)
<10ms
Cached latency
$3.50
Per 1K queries (80% cache)
95%
Precision
80%
Cost savings

🔧 System Components

📄 rag_app.py
Simple RAG - Auto-loads .env, single API
🚀 rag_app_enhanced.py
Enhanced RAG + Optional Redis cache
🎯 qa_generator.py
Generate evaluation datasets (50-100 Q&A pairs, 5 types)
📊 rag_evaluator.py
4-metric evaluation: faithfulness, relevancy, context, correctness
💰 cost_calculator.py
Real cost analysis (Simple vs Enhanced vs Cache)
🐳 Docker
Production-ready Dockerfile + docker-compose

Comparison Matrix

Feature Simple RAG Enhanced RAG Enhanced + Cache
Document Size < 50 pages Unlimited Unlimited
Setup Complexity Easy (1 API) Moderate (3 APIs) Moderate (3 APIs + Redis)
Multi-document No Yes Yes
Precision 85% 95% 95%
Cost/1K queries $15 $17.50 $3.50 🎯
Latency 2-3s 2.5-3.5s <10ms (cached)
Repeated Queries Full cost each time Full cost each time Free from cache
Evaluation Support Yes Yes Yes
Best for Prototypes Production High-volume production

📊 Evaluation Pipeline

Automated groundtruth generation and 4-metric evaluation system

1️⃣ Q&A Generator

qa_generator.py

  • Auto-generate Q&A pairs
  • 5 question types (factual, conceptual, multi-hop)
  • Difficulty levels (easy, medium, hard)
  • Export to JSON
  • Claude-powered generation

2️⃣ RAG System

Run on test dataset

  • Load evaluation dataset
  • Process each question
  • Collect RAG responses
  • Measure retrieval quality
  • Track latency & costs

3️⃣ Evaluator

rag_evaluator.py

  • 4-metric evaluation
  • LLM-as-judge scoring
  • Detailed breakdowns
  • Comparative analysis
  • Export results

Evaluation Metrics

Faithfulness

Answer supported by retrieved context

🎯
Answer Relevancy

Directly addresses the question

📚
Context Relevancy

Retrieval quality (right chunks)

Correctness

Matches ground truth answer

Evaluation Workflow

📝
Generate

50-100 Q&A pairs

🤖
Run RAG

Test both systems

📊
Evaluate

4-metric scoring

Improve

Iterate on weak areas

🌐 Production Deployment

🚀 Render Deployment

  • Live on Render.com
  • 4 apps running
  • Auto-deploy from GitHub
  • Environment variables configured
  • Optional Redis add-on

🐳 Docker Setup

  • Production Dockerfile
  • docker-compose.yml
  • Redis container included
  • Multi-app orchestration
  • Health checks configured

🔐 Configuration

  • Auto-load from .env
  • Secure API key management
  • Optional cache (USE_CACHE)
  • .gitignore for secrets
  • Environment-based config

Tech Stack

🤖 LLM
Claude Sonnet 4.5 (Anthropic)
🔢 Embeddings
Cohere embed-multilingual-v3.0
💾 Vector DB
Pinecone (serverless)
🎯 Re-ranking
Cohere rerank-multilingual-v3.0
⚡ Cache
Redis 7 (optional)
🖥️ Framework
Streamlit + Python 3.11