Skip to content

Architecture Design

Before writing a single line of implementation code, spend 20 minutes on a design document. This is not bureaucracy — it forces you to catch integration problems before they become debugging problems at demo time.

Learning objectives

  • Document a system architecture with enough detail to guide implementation
  • Identify the data flow, decision points, and failure modes before coding
  • Choose the right components for the specific problem, not the most impressive-sounding ones

Design document template

Copy this into a DESIGN.md at the root of your project.

## Project: [Name]

### Problem statement
One sentence: what does this system do, and for whom?

### Data flow
1. User input → [what form?]
2. [First processing step] → [what output?]
3. [Continue until LLM response is returned]
4. Response → user

### Components
| Component | Technology | Why this choice |
|-----------|------------|-----------------|
| LLM API   | gpt-4o-mini | Cost-effective for this task |
| Vector DB | ChromaDB   | Local, no external service needed |
| API layer | FastAPI    | Async, streaming support |
| ...       | ...        | ... |

### Failure modes
- What happens if the LLM API is down? → [fallback]
- What happens if retrieval returns nothing? → [fallback]
- What happens if the response is malformed? → [retry/error]

### Evaluation
- Metric 1: [name] — [how measured] — [target value]
- Metric 2: [name] — [how measured] — [target value]

### Out of scope
- [Things explicitly not built, to prevent scope creep]

Architecture patterns by option

Option A — RAG service architecture

User HTTP request
FastAPI /chat endpoint (async)
Cache check (exact-match, SHA256 key)
    ├── HIT → return cached response (< 5ms)
    └── MISS ↓
Embedding API (text-embedding-3-small)
ChromaDB retrieval (top-5 chunks by cosine similarity)
Prompt assembly (system + retrieved context + user question)
OpenAI chat completion (gpt-4o-mini, stream=True)
SSE token stream → HTTP response
Background task: log to audit trail, update cache

Key decisions: - Exact-match cache before embedding (embedding costs $0.02/M tokens — not free) - Retrieve 5 chunks, not 10 — more chunks dilute the context and increase tokens - Stream the response — latency for long answers is perceptible


Option B — LangGraph agent architecture

User question
Planner node (gpt-4o, structured output)
    → Research plan: list of sub-questions
Researcher node (parallel, asyncio.gather)
    → Tool calls: web search or document retrieval
    → Sources: list of (url, snippet) pairs
Writer node
    → Draft report with citations
Critic node (gpt-4o-mini)
    → quality_score: float (0–1)
    → feedback: str
Conditional router
    ├── quality_score >= 0.8 → END
    └── quality_score < 0.8 and attempts < 3 → Writer node (loop)

State structure:

class ResearchState(TypedDict):
    question: str
    plan: list[str]
    sources: Annotated[list[dict], operator.add]
    draft: str
    quality_score: float
    feedback: str
    attempts: int


Option C — Document intelligence architecture

PDF upload (multipart/form-data)
FastAPI /extract endpoint
Text extraction (PyMuPDF page-by-page)
Chunking (if doc > 8k tokens, chunk + summarize first)
OpenAI function calling
    → Tool: extract_fields(schema)
    → Returns: validated Pydantic model
Validation layer
    → Check required fields present
    → Confidence scoring (field completeness, value plausibility)
Return JSON with confidence scores

Component selection guide

Decision RAG service Agent Doc intelligence
LLM gpt-4o-mini gpt-4o (planning) + mini (worker) gpt-4o-mini
Vector DB ChromaDB ChromaDB or skip Skip
Framework FastAPI LangGraph + FastAPI FastAPI
Caching Exact-match Per-session state Skip
Streaming Yes Optional (stream final report) No
Eval RAGAS faithfulness Draft quality score F1 vs ground truth

Don't choose a component because it sounds impressive

LangGraph is the right choice for stateful multi-step agents. It adds complexity for a simple one-shot RAG pipeline — use a plain async function instead. Match complexity to need.


Mermaid diagram (paste into your README)

flowchart TD
    A[User Question] --> B{Cache Hit?}
    B -->|Yes| C[Return Cached Response]
    B -->|No| D[Embed Query]
    D --> E[Retrieve Top-5 Chunks]
    E --> F[Assemble Prompt]
    F --> G[LLM API - Stream]
    G --> H[SSE Response to Client]
    H --> I[Background: Cache + Log]

01-project-brief | 03-implementation