LangSmith¶
LangSmith is LangChain's observability platform for LLM applications. It captures traces automatically when you use LangChain components, lets you inspect every prompt and response, build evaluation datasets from production traces, and run automated evaluations. You can also use it with non-LangChain code via its Python SDK.
Learning objectives¶
- Enable LangSmith tracing with environment variables
- Use the
@traceabledecorator for custom functions - Build evaluation datasets from production traces
- Run LLM-as-judge evaluations on a dataset
- Interpret the LangSmith dashboard
Setup¶
import os
# Set these before importing langchain/langgraph
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGSMITH_API_KEY", "")
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app" # Groups traces in the UI
# All LangChain/LangGraph calls are now automatically traced
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
response = llm.invoke([HumanMessage(content="What is RAG?")])
# → Visible in LangSmith UI immediately
Tracing custom (non-LangChain) code¶
import os
from langsmith import traceable
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
@traceable(name="classify_support_ticket", tags=["support", "classification"])
def classify_ticket(ticket_text: str) -> str:
"""Classify a support ticket. Traced automatically by LangSmith."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Classify as: billing, technical, account, or shipping."},
{"role": "user", "content": ticket_text}
],
temperature=0.0,
)
return response.choices[0].message.content
@traceable(name="full_support_pipeline")
def handle_ticket(ticket: str) -> dict:
"""Multi-step pipeline — each step traced as a child span."""
category = classify_ticket(ticket)
response = generate_response(ticket, category)
return {"category": category, "response": response}
@traceable(name="generate_response")
def generate_response(ticket: str, category: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"You handle {category} support issues. Be concise."},
{"role": "user", "content": ticket}
],
temperature=0.0,
)
return response.choices[0].message.content
# Test — all calls appear in LangSmith as a nested trace
result = handle_ticket("I was charged twice for my subscription this month!")
print(f"Category: {result['category']}")
print(f"Response: {result['response'][:200]}")
Building evaluation datasets¶
import os
from langsmith import Client
ls_client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))
# Create a dataset from hand-labeled examples
dataset = ls_client.create_dataset(
"support-ticket-classification-v1",
description="Labeled support tickets for classifier evaluation"
)
examples = [
{
"inputs": {"ticket": "My credit card was charged twice this month."},
"outputs": {"category": "billing"}
},
{
"inputs": {"ticket": "The app crashes every time I try to upload a file."},
"outputs": {"category": "technical"}
},
{
"inputs": {"ticket": "I can't log into my account after the password reset."},
"outputs": {"category": "account"}
},
{
"inputs": {"ticket": "My order was supposed to arrive yesterday but hasn't shown up."},
"outputs": {"category": "shipping"}
},
]
ls_client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)
print(f"Dataset created: {dataset.id} with {len(examples)} examples")
Running evaluations¶
import os
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
ls_client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))
# The function to evaluate
@traceable
def classify_ticket_v2(inputs: dict) -> dict:
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Reply with exactly one word: billing, technical, account, or shipping."},
{"role": "user", "content": inputs["ticket"]}
],
temperature=0.0,
)
return {"category": response.choices[0].message.content.strip().lower()}
# Exact match evaluator
def exact_match(run, example) -> dict:
predicted = run.outputs.get("category", "").strip().lower()
expected = example.outputs.get("category", "").strip().lower()
return {"key": "exact_match", "score": 1 if predicted == expected else 0}
# Run evaluation
results = evaluate(
classify_ticket_v2,
data="support-ticket-classification-v1", # Dataset name
evaluators=[exact_match],
experiment_prefix="classifier-v2",
)
print(f"Evaluation complete. Results visible in LangSmith UI.")
Key LangSmith concepts¶
| Concept | Description |
|---|---|
| Run | A single traced operation (one LLM call, one chain invocation) |
| Trace | A tree of runs — the full execution of a request |
| Dataset | Collection of input/output examples for evaluation |
| Experiment | Running a function against a dataset; stores results for comparison |
| Evaluator | Function that scores an experiment result |
| Feedback | Human or automated rating attached to a run |
Use the LangSmith trace explorer to debug production failures
When a user reports a bad output, search by session_id or trace_id in LangSmith to see the exact prompt, model response, token counts, and all child spans. No more asking users "what did you type exactly?"
LangSmith stores your prompts and responses
This is a data privacy consideration. Review your organization's data handling policy before enabling LangSmith tracing in production. Use filter_inputs and filter_outputs to redact PII from traces.