Inference API¶
The Hugging Face Inference API lets you run models hosted on the Hub without managing any GPU infrastructure. You send a POST request with your text; you get back the model's output. For prototyping and low-volume production use, it's faster to set up than a self-hosted GPU server.
Learning objectives¶
- Call the Inference API using the
InferenceClientPython SDK - Handle rate limits and retries correctly
- Use the serverless API for tasks: text generation, embeddings, classification, image generation
- Understand when to use Inference Endpoints vs the serverless API
InferenceClient — the Python SDK¶
The huggingface_hub package includes InferenceClient, which handles authentication, retries, and response parsing.
import os
from huggingface_hub import InferenceClient
client = InferenceClient(token=os.getenv("HF_TOKEN"))
# Text generation
response = client.text_generation(
prompt="Explain gradient descent in one paragraph, as if to a software engineer.",
model="mistralai/Mistral-7B-Instruct-v0.3",
max_new_tokens=200,
temperature=0.7,
do_sample=True,
)
print(response)
Chat completions (OpenAI-compatible endpoint)¶
Many models on the Hub expose an OpenAI-compatible /v1/chat/completions endpoint. This means you can use the OpenAI Python client with a custom base_url.
import os
from openai import OpenAI
# Use HF Inference API with the OpenAI client
hf_client = OpenAI(
base_url="https://api-inference.huggingface.co/v1/",
api_key=os.getenv("HF_TOKEN"),
)
response = hf_client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is the difference between LSTM and Transformer?"}
],
max_tokens=300,
temperature=0.5,
)
print(response.choices[0].message.content)
OpenAI-compatible models on HF
Any model with "Inference Endpoints" enabled supports the /v1/chat/completions route. This lets you swap between OpenAI and open-source models by changing base_url and model — no other code changes needed.
Embeddings via Inference API¶
import os
import numpy as np
from huggingface_hub import InferenceClient
client = InferenceClient(token=os.getenv("HF_TOKEN"))
def embed_texts(texts: list[str], model: str = "BAAI/bge-small-en-v1.5") -> np.ndarray:
"""Batch embed texts using the Inference API."""
embeddings = client.feature_extraction(
text=texts,
model=model,
normalize=True
)
return np.array(embeddings)
# Semantic similarity
texts = [
"How do I cancel my subscription?",
"What is the process to unsubscribe?",
"Tell me about your pricing plans.",
]
embs = embed_texts(texts)
# Cosine similarity (vectors already normalized)
similarity_matrix = embs @ embs.T
print("Similarity matrix:")
for i, text_i in enumerate(texts):
for j, text_j in enumerate(texts):
if i < j:
print(f" {i}↔{j}: {similarity_matrix[i,j]:.3f} | '{text_i[:30]}...' ↔ '{text_j[:30]}...'")
Classification and NER¶
from huggingface_hub import InferenceClient
client = InferenceClient(token=os.getenv("HF_TOKEN"))
# Sentiment analysis
sentiment = client.text_classification(
text="The new update broke all my workflows. Very frustrated.",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
print(f"Sentiment: {sentiment[0]['label']} ({sentiment[0]['score']:.3f})")
# Zero-shot classification (no fine-tuning needed)
result = client.zero_shot_classification(
text="I need to update my payment method.",
labels=["billing", "account", "technical support", "general"],
model="facebook/bart-large-mnli"
)
# Sort by score
ranked = sorted(zip(result["labels"], result["scores"]), key=lambda x: -x[1])
for label, score in ranked:
print(f" {label:<20} {score:.3f}")
# Named entity recognition
entities = client.token_classification(
text="Elon Musk founded SpaceX in Hawthorne, California in 2002.",
model="dslim/bert-base-NER"
)
for entity in entities:
if entity["entity_group"] in ("PER", "ORG", "LOC"):
print(f" [{entity['entity_group']}] {entity['word']}")
Streaming text generation¶
import os
from huggingface_hub import InferenceClient
client = InferenceClient(token=os.getenv("HF_TOKEN"))
# Streaming output — tokens arrive as they're generated
print("Streaming response:")
for token in client.text_generation(
prompt="List 5 best practices for writing RAG prompts:",
model="mistralai/Mistral-7B-Instruct-v0.3",
max_new_tokens=300,
stream=True
):
print(token, end="", flush=True)
print()
Rate limits and error handling¶
The free Inference API has rate limits: ~300 requests/hour for most models.
import os
import time
from huggingface_hub import InferenceClient
from huggingface_hub.utils import HfHubHTTPError
import logging
logger = logging.getLogger(__name__)
client = InferenceClient(token=os.getenv("HF_TOKEN"))
def call_with_retry(
func,
*args,
max_retries: int = 3,
backoff_base: float = 2.0,
**kwargs
):
"""Call a function with exponential backoff on rate limit errors."""
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except HfHubHTTPError as e:
if "429" in str(e) or "rate limit" in str(e).lower():
wait = backoff_base ** attempt
logger.warning(f"Rate limited. Waiting {wait}s (attempt {attempt+1}/{max_retries})")
time.sleep(wait)
elif "503" in str(e) or "loading" in str(e).lower():
# Model is loading (cold start) — wait longer
wait = 20.0
logger.info(f"Model loading. Waiting {wait}s...")
time.sleep(wait)
else:
raise
raise RuntimeError(f"Failed after {max_retries} retries")
# Usage
response = call_with_retry(
client.text_generation,
prompt="Summarize the key concepts of attention mechanisms.",
model="mistralai/Mistral-7B-Instruct-v0.3",
max_new_tokens=150
)
print(response)
Cold start latency
Free-tier models spin down when idle. The first request after idle may return a 503 with "loading" in the error message. This is normal — retry after 20–30 seconds. Paid Inference Endpoints stay warm and have no cold start.
Inference API vs Inference Endpoints¶
| Feature | Serverless Inference API | Inference Endpoints |
|---|---|---|
| Setup | None (use immediately) | Deploy via UI or API |
| Cost | Free tier + pay-per-call | Hourly instance cost |
| Cold start | Yes (up to 60s) | No (always warm) |
| Custom models | Public Hub models only | Any Hub model or private |
| Hardware | Shared GPU pool | Dedicated GPU |
| Rate limits | ~300/hour free | None (you own the instance) |
| SLA | Best effort | Uptime SLA available |
Use serverless for: Prototyping, development, low-volume applications, testing model quality. Use Endpoints for: Production applications, latency-sensitive use cases, private models, consistent throughput.
# Deploying a dedicated Inference Endpoint programmatically
from huggingface_hub import create_inference_endpoint
# endpoint = create_inference_endpoint(
# name="my-production-endpoint",
# repository="mistralai/Mistral-7B-Instruct-v0.3",
# framework="pytorch",
# task="text-generation",
# accelerator="gpu",
# instance_size="x1",
# instance_type="nvidia-l4",
# region="us-east-1",
# type="protected", # "public", "protected", or "private"
# )
# endpoint.wait() # Wait for deployment
# print(f"Endpoint URL: {endpoint.url}")