Interview Questions — RAG Q&A Chatbot¶
These questions are specifically about the design decisions in this project. Prepare concrete answers citing your implementation choices.
Q1: How did you chunk the documents and why?
Show answer
Fixed-size chunks of 800 characters with a 150-character overlap. The overlap prevents answers from being split across chunk boundaries — any sentence that spans two chunks appears in both, so retrieval won't miss it. Fixed-size is simpler to implement and reason about than semantic chunking; for a portfolio project with well-structured docs, it performs acceptably. For production with dense technical PDFs, paragraph-level semantic chunking (splitting at double newlines, respecting section headers) would improve precision.
Q2: Why gpt-4o-mini instead of gpt-4o for this project?
Show answer
This is a factual Q&A task: the answer is grounded in retrieved context, so the model's role is reading comprehension and citation, not open-ended reasoning. gpt-4o-mini handles this well at 10x lower cost. gpt-4o would be appropriate if: the retrieved context is long and complex (requiring multi-step reasoning to synthesize), the questions involve ambiguity that requires careful interpretation, or quality evaluations showed gpt-4o-mini failing on specific question types.
Q3: How does your caching work, and what inputs determine the cache key?
Show answer
SHA-256 hash of a JSON-serialized object containing the question and the n_results parameter. Both are included because the same question with a different number of retrieved chunks could produce a different answer (more chunks = more context = potentially longer/different answer). Temperature is fixed at 0.0 for all cacheable requests — caching non-deterministic outputs would serve stale responses. The TTL is 10 minutes; after that, the cache entry expires and the next request goes to the LLM.
Q4: What are the failure modes of this system and how do you handle them?
Show answer
Three main failure modes:
-
Retrieval failure — the question is about something not in the corpus. Handled by checking
len(chunks) == 0and returning a 404 with a clear message rather than generating a hallucinated answer. -
LLM API failure — OpenAI returns 429 (rate limit) or 5xx. The AsyncOpenAI client has built-in exponential backoff for transient errors. For sustained outages, the appropriate response is to surface the error to the caller, not serve a cached (potentially stale) response.
-
Stale index — the documents were updated but the index wasn't re-ingested. The freshest content is in the docs directory but the index still has old chunks. Mitigation: run
ingest.pyon every document update, or add an incremental re-index that compares file modification times.
Q5: How would you scale this from 100 to 10,000 requests/day?
Show answer
The current architecture is single-instance with an in-memory cache. At 10,000 requests/day, the bottlenecks are:
- In-memory cache: doesn't persist across restarts and doesn't work with multiple instances. Replace with Redis.
- ChromaDB: a local PersistentClient is fine for one instance. At scale with concurrent writes (re-ingestion during traffic), move to Pinecone or a hosted ChromaDB.
- Single FastAPI instance: use Fly.io's auto-scaling or deploy multiple replicas behind a load balancer. Async FastAPI handles concurrent requests well, so one instance with a proper server (uvicorn with multiple workers) can handle significant load before needing horizontal scaling.
The OpenAI API is not a bottleneck unless you're exceeding your tier's TPM limit — in which case, add a semaphore to limit concurrent LLM calls or upgrade the tier.