LangSmith¶

LangSmith is LangChain's observability platform for LLM applications. It captures traces automatically when you use LangChain components, lets you inspect every prompt and response, build evaluation datasets from production traces, and run automated evaluations. You can also use it with non-LangChain code via its Python SDK.

Learning objectives¶

Enable LangSmith tracing with environment variables
Use the @traceable decorator for custom functions
Build evaluation datasets from production traces
Run LLM-as-judge evaluations on a dataset
Interpret the LangSmith dashboard

Setup¶

import os

# Set these before importing langchain/langgraph
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGSMITH_API_KEY", "")
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"  # Groups traces in the UI

# All LangChain/LangGraph calls are now automatically traced
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini", api_key=os.getenv("OPENAI_API_KEY"))
response = llm.invoke([HumanMessage(content="What is RAG?")])
# → Visible in LangSmith UI immediately

Tracing custom (non-LangChain) code¶

import os
from langsmith import traceable
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

@traceable(name="classify_support_ticket", tags=["support", "classification"])
def classify_ticket(ticket_text: str) -> str:
    """Classify a support ticket. Traced automatically by LangSmith."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Classify as: billing, technical, account, or shipping."},
            {"role": "user", "content": ticket_text}
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content

@traceable(name="full_support_pipeline")
def handle_ticket(ticket: str) -> dict:
    """Multi-step pipeline — each step traced as a child span."""
    category = classify_ticket(ticket)
    response = generate_response(ticket, category)
    return {"category": category, "response": response}

@traceable(name="generate_response")
def generate_response(ticket: str, category: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"You handle {category} support issues. Be concise."},
            {"role": "user", "content": ticket}
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content

# Test — all calls appear in LangSmith as a nested trace
result = handle_ticket("I was charged twice for my subscription this month!")
print(f"Category: {result['category']}")
print(f"Response: {result['response'][:200]}")

Building evaluation datasets¶

import os
from langsmith import Client

ls_client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))

# Create a dataset from hand-labeled examples
dataset = ls_client.create_dataset(
    "support-ticket-classification-v1",
    description="Labeled support tickets for classifier evaluation"
)

examples = [
    {
        "inputs": {"ticket": "My credit card was charged twice this month."},
        "outputs": {"category": "billing"}
    },
    {
        "inputs": {"ticket": "The app crashes every time I try to upload a file."},
        "outputs": {"category": "technical"}
    },
    {
        "inputs": {"ticket": "I can't log into my account after the password reset."},
        "outputs": {"category": "account"}
    },
    {
        "inputs": {"ticket": "My order was supposed to arrive yesterday but hasn't shown up."},
        "outputs": {"category": "shipping"}
    },
]

ls_client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id,
)
print(f"Dataset created: {dataset.id} with {len(examples)} examples")

Running evaluations¶

import os
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

ls_client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))

# The function to evaluate
@traceable
def classify_ticket_v2(inputs: dict) -> dict:
    from openai import OpenAI
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Reply with exactly one word: billing, technical, account, or shipping."},
            {"role": "user", "content": inputs["ticket"]}
        ],
        temperature=0.0,
    )
    return {"category": response.choices[0].message.content.strip().lower()}

# Exact match evaluator
def exact_match(run, example) -> dict:
    predicted = run.outputs.get("category", "").strip().lower()
    expected = example.outputs.get("category", "").strip().lower()
    return {"key": "exact_match", "score": 1 if predicted == expected else 0}

# Run evaluation
results = evaluate(
    classify_ticket_v2,
    data="support-ticket-classification-v1",  # Dataset name
    evaluators=[exact_match],
    experiment_prefix="classifier-v2",
)
print(f"Evaluation complete. Results visible in LangSmith UI.")

Key LangSmith concepts¶

Concept	Description
Run	A single traced operation (one LLM call, one chain invocation)
Trace	A tree of runs — the full execution of a request
Dataset	Collection of input/output examples for evaluation
Experiment	Running a function against a dataset; stores results for comparison
Evaluator	Function that scores an experiment result
Feedback	Human or automated rating attached to a run

Use the LangSmith trace explorer to debug production failures

When a user reports a bad output, search by session_id or trace_id in LangSmith to see the exact prompt, model response, token counts, and all child spans. No more asking users "what did you type exactly?"

LangSmith stores your prompts and responses

This is a data privacy consideration. Review your organization's data handling policy before enabling LangSmith tracing in production. Use filter_inputs and filter_outputs to redact PII from traces.

01-tracing-and-logging | 03-cost-tracking