Setup and Environment — Function-Calling Data Extractor¶
Project structure¶
data-extractor/
├── app.py ← FastAPI application
├── schemas.py ← Pydantic extraction schemas
├── extractor.py ← OpenAI tool_use extraction logic
├── eval.py ← F1 evaluation against ground truth
├── requirements.txt
├── .env.example
└── test_data/
├── invoices/
├── support_tickets/
└── ground_truth.json
Environment setup¶
# requirements.txt
openai==1.51.0
fastapi==0.115.0
uvicorn==0.30.6
pydantic==2.9.0
pymupdf==1.24.11
python-dotenv==1.0.1
httpx==0.27.2
Test data samples¶
Create sample unstructured texts to test extraction:
# test_data/samples.py
INVOICE_TEXT = """
INVOICE #INV-2024-0892
Date: November 15, 2024
Due Date: December 15, 2024
Bill To:
Acme Corporation
123 Business Ave, Suite 100
New York, NY 10001
Description Qty Unit Price Total
Software Development Services 40 $150.00 $6,000.00
Cloud Infrastructure Setup 1 $500.00 $500.00
Subtotal: $6,500.00
Tax: $520.00
Total: $7,020.00
Payment Terms: Net 30
"""
SUPPORT_TICKET_TEXT = """
From: customer@example.com
Subject: Cannot login to my account - URGENT
I've been trying to log into my account for the past 2 hours and keep getting
"Invalid credentials" even though I just reset my password. This is blocking
me from accessing critical project files before my 3pm deadline today.
Browser: Chrome 119
OS: macOS Ventura 13.5
Account: john.doe@example.com
"""