Deployment — Fine-Tuned Classifier¶
Docker for the inference server¶
FROM python:3.11-slim
WORKDIR /app
# CPU-only torch (smaller image)
RUN pip install torch==2.4.1 --index-url https://download.pytorch.org/whl/cpu
COPY requirements-inference.txt .
RUN pip install --no-cache-dir -r requirements-inference.txt
COPY app.py .
COPY merged-model/ ./merged-model/
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# requirements-inference.txt (CPU-only, no training deps)
transformers==4.45.0
fastapi==0.115.0
uvicorn==0.30.6
pydantic==2.9.0
python-dotenv==1.0.1
The inference container doesn't need GPU
After merging, the weights are standard PyTorch tensors. CPU inference is viable for a 0.5B model: ~200–400ms per classification, vs. < 50ms on GPU. For batch processing, CPU is often sufficient. For < 100ms latency requirements, use a GPU instance.
Fly.io deployment¶
# fly.toml
app = 'sentiment-classifier'
primary_region = 'iad'
[build]
dockerfile = 'Dockerfile'
[http_service]
internal_port = 8000
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
[[vm]]
memory = "2gb" # Model weights need ~1GB RAM
cpu_kind = "shared"
cpus = 2
Cold start for local models is 10–30 seconds
Loading a 0.5B model from disk takes 10–30 seconds on the first request. Set min_machines_running = 1 to keep an instance warm, or accept the cold start and show a loading indicator in your UI.
Pushing the model to HuggingFace Hub¶
Share your fine-tuned model and demonstrate open-source contributions:
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="./merged-model",
repo_id="your-username/qwen2-0.5b-sentiment",
repo_type="model",
)
Then in app.py, replace MODEL_PATH with the Hub repo ID — Transformers will download it automatically:
MODEL_ID = "your-username/qwen2-0.5b-sentiment"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, ...)