Mock Interview Script¶
This is a complete mock interview covering a 45-minute technical screen for a junior LLM engineer role. Run it in pairs: one person reads the interviewer questions, the other answers. After each answer, the interviewer reads the feedback note before moving on.
Setup instructions¶
Interviewer: Read each question exactly as written. Give the candidate 2–3 minutes per question. Use the feedback note to calibrate — don't read it aloud until the candidate has finished answering.
Candidate: Answer as if this is a real interview. No notes. Speak your reasoning out loud. If you don't know something, say what you do know and where your knowledge ends.
Time: Allow 45 minutes total. Use the time markers as guides.
Part 1 — Introduction (5 minutes)¶
Q1 (Interviewer): "Tell me about yourself and what drew you to LLM engineering."
What to listen for
A confident answer covering: background, what they've built (not just studied), and a specific thing about LLMs that interests them technically. Generic answers ("I'm passionate about AI") are weak. Strong: "I built a RAG pipeline that..., and I got interested in how retrieval quality affects answer faithfulness."
Q2 (Interviewer): "Walk me through the most complex LLM project you've built. I want to understand the architecture, the challenges, and what you'd do differently."
What to listen for
Specificity: model names, library versions, how they measured success. Does the candidate understand their own design decisions? Do they have an honest answer about what didn't work? A candidate who says "everything worked great" has either never shipped or isn't self-aware.
Part 2 — Technical depth (25 minutes)¶
Q3 (Interviewer): "You're building a RAG service. The user asks: 'What is our refund policy?' Your retrieval returns three chunks about refund policy — but two are from an old version of the policy that's been superseded. How do you handle this?"
What to listen for
The candidate should identify: (1) the metadata problem — you need document versioning and filtering; (2) the solution — store version or last_updated as ChromaDB metadata and filter by date; (3) the deeper issue — retrieval quality depends on index freshness. A great answer also mentions: monitoring for staleness, re-indexing on document update, and potentially using the document date in the prompt to let the model reason about which chunk is current.
Q4 (Interviewer): "Explain what happens at the code level when you make a streaming API call to OpenAI, and how SSE works end-to-end from the server to the browser."
What to listen for
Level of detail: the candidate should cover (1) stream=True in the API call; (2) the generator pattern — async for chunk in response; (3) SSE format: data: {json}\n\n; (4) StreamingResponse in FastAPI; (5) X-Accel-Buffering: no for nginx; (6) the client reading EventSource or iter_lines(). Missing the proxy buffering issue is a common gap.
Q5 (Interviewer): "I'm seeing that our LLM service's P95 latency is 6 seconds, but P50 is 800ms. What's causing this, and what would you do to diagnose and fix it?"
What to listen for
Diagnosis: P95 vs P50 gap suggests occasional slow requests, not uniformly slow service. Causes: long prompts, slow retrieval for certain queries, LLM API rate limiting causing queue buildup, cold starts for serverless. The candidate should want to look at: latency breakdown by component (retrieval vs LLM), the distribution of prompt token counts, whether the P95 requests are retries. Fixes: prompt length cap, cache warming, min_containers=1 for serverless, async retry logic.
Q6 (Interviewer): "When would you choose to fine-tune a model versus use RAG? Give me a scenario where each is clearly the right choice."
What to listen for
RAG scenario: a customer support bot for a company with a documentation base that changes weekly. Fine-tuning scenario: a company that needs all LLM outputs to follow a specific JSON schema, uses the same format 10,000 times per day, and has labeled training data. The candidate should know the two key questions: does the knowledge change? Do I have labeled data? A strong answer also mentions: RAG is better for knowledge, fine-tuning is better for behavior/style.
Q7 (Interviewer): "Write me a Python function that makes a call to the OpenAI chat API with automatic retry logic for rate limits. Use exponential backoff."
Give the candidate a whiteboard or text editor.
# Expected answer (or close to it):
import asyncio
import random
from openai import AsyncOpenAI, RateLimitError
aclient = AsyncOpenAI()
async def chat_with_retry(messages: list[dict], max_retries: int = 3) -> str:
base_delay = 1.0
for attempt in range(max_retries + 1):
try:
response = await aclient.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.0,
)
return response.choices[0].message.content
except RateLimitError:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
What to listen for
(1) Uses AsyncOpenAI, not OpenAI; (2) catches RateLimitError specifically, not bare Exception; (3) re-raises on the last attempt; (4) adds jitter to prevent thundering herd; (5) uses await asyncio.sleep not time.sleep (which would block). The note about the OpenAI SDK having built-in retry is a bonus signal.
Part 3 — System design (10 minutes)¶
Q8 (Interviewer): "Design the backend for a document Q&A product. Users upload PDFs and can ask questions about them. There are 1,000 users, each with 10–100 PDFs, each PDF is 5–50 pages."
Give the candidate a whiteboard or text editor. Expect them to spend 1–2 minutes asking clarifying questions.
What to listen for
Clarifying questions: concurrent users? latency requirement? multi-user or single-user isolation? Key design decisions: per-user collection isolation in ChromaDB/Pinecone (security boundary); chunking strategy for PDFs (page-level vs paragraph-level); async ingestion (don't block the upload endpoint on embedding); streaming for Q&A. Failure modes: what if the PDF is image-only? (OCR or graceful error). The candidate should be able to sketch: upload API → async ingestion worker → vector store → Q&A API → RAG pipeline.
Part 4 — Behavioral (5 minutes)¶
Q9 (Interviewer): "Tell me about a time an LLM gave you an unexpected output that caused a problem. What did you learn?"
What to listen for
Honesty and specificity. "LLMs sometimes hallucinate" is not an answer. A strong answer: "I had a function-calling pipeline where the model occasionally returned JSON with a key in a different case (Total_Amount vs total_amount) which broke my Pydantic validation. I learned to normalize extracted keys and add strict schema validation with error logging."
Q10 (Interviewer): "If you had two weeks to improve a RAG system's answer quality from 70% to 90% on a test set, what would you do, in what order?"
What to listen for
The answer should prioritize high-leverage changes first: (1) analyze failure cases — are they retrieval failures or generation failures? (2) fix retrieval first — better chunking, reranker; (3) fix generation — better prompt, more context, lower temperature; (4) add semantic caching to avoid re-testing the same queries. A good answer establishes the diagnosis step before prescribing solutions.
Debrief template¶
After the mock, give the candidate feedback on each dimension:
| Dimension | Observation | Suggestion |
|---|---|---|
| Technical accuracy | ||
| Specificity (named real tools/models) | ||
| Tradeoff reasoning | ||
| Communication clarity | ||
| Handling of unknown questions | ||
| Coding exercise |
Overall: What is the single most impactful thing this candidate could do to improve their interview performance in the next two weeks?