Datasets and Model Cards¶
The datasets library is the Hugging Face counterpart to pandas for ML datasets — it handles sharded files, streaming from disk, and Arrow-backed fast processing. Model cards document what a model does, how it was trained, and where it fails.
Learning objectives¶
- Load and explore Hub datasets with
datasets - Stream large datasets without loading them fully into memory
- Process datasets with
.map()for tokenization and feature extraction - Read and write model cards using the
ModelCardAPI
Loading datasets¶
from datasets import load_dataset
# Load a small benchmark dataset (downloads and caches automatically)
dataset = load_dataset("openai/gsm8k", "main")
print(dataset)
# DatasetDict({
# train: Dataset({features: ['question', 'answer'], num_rows: 7473})
# test: Dataset({features: ['question', 'answer'], num_rows: 1319})
# })
# Access splits
train = dataset["train"]
test = dataset["test"]
# Inspect examples
print(train[0])
# {'question': 'Natalia sold clips...', 'answer': '72'}
# Filter, slice, and select
hard_examples = train.filter(lambda x: len(x["answer"]) > 3)
first_100 = train.select(range(100))
Exploring dataset schema and statistics¶
from datasets import load_dataset
import pandas as pd
dataset = load_dataset("squad", split="validation[:500]")
# Schema
print(dataset.features)
# {'id': Value(dtype='string'), 'title': Value(dtype='string'),
# 'context': Value(dtype='string'), 'question': Value(dtype='string'),
# 'answers': Sequence({'text': [...], 'answer_start': [...]})}
# Convert to pandas for quick exploration
df = dataset.to_pandas()
print(df.describe(include="all"))
print(f"\nAvg context length: {df['context'].str.len().mean():.0f} chars")
print(f"Avg question length: {df['question'].str.len().mean():.0f} chars")
print(f"Unique titles: {df['title'].nunique()}")
Streaming large datasets¶
Datasets like The Pile, RedPajama, or Common Crawl are hundreds of GB. Stream them instead of downloading.
from datasets import load_dataset
# streaming=True: data is loaded in chunks, never fully in memory
streamed = load_dataset(
"allenai/c4",
"en",
split="train",
streaming=True
)
# Iterate over batches
for i, batch in enumerate(streamed.iter(batch_size=32)):
texts = batch["text"]
print(f"Batch {i}: {len(texts)} examples, first: {texts[0][:80]}...")
if i >= 2:
break
# Take a fixed number of examples for quick experiments
sample = list(streamed.take(1000))
print(f"Sampled {len(sample)} examples")
Processing datasets with .map()¶
.map() applies a function to every example (or batch) and caches the result.
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
dataset = load_dataset("imdb", split="train[:1000]")
def tokenize(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=256,
return_tensors=None # Return lists, not tensors (for dataset caching)
)
# Tokenize in batches — much faster than row-by-row
tokenized = dataset.map(
tokenize,
batched=True,
batch_size=64,
num_proc=2, # Parallel processing
remove_columns=["text"] # Drop original text column
)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
print(tokenized)
# Dataset with input_ids, attention_mask, label — ready for PyTorch DataLoader
Pushing a dataset to the Hub¶
from datasets import Dataset, DatasetDict
# Create a dataset from Python objects
examples = [
{"question": "What is RAG?", "answer": "Retrieval-Augmented Generation", "category": "rag"},
{"question": "What is HNSW?", "answer": "Hierarchical Navigable Small World", "category": "vector_db"},
{"question": "What is LoRA?", "answer": "Low-Rank Adaptation", "category": "fine_tuning"},
]
train_ds = Dataset.from_list(examples)
eval_ds = Dataset.from_list(examples[:1])
full_dataset = DatasetDict({"train": train_ds, "validation": eval_ds})
# Push to Hub (creates or updates the repository)
full_dataset.push_to_hub(
"your-username/llm-course-qa",
token=os.getenv("HF_TOKEN"),
private=True
)
Reading and writing model cards¶
Model cards document intended use, training data, evaluation results, and limitations. The huggingface_hub SDK includes a ModelCard class.
from huggingface_hub import ModelCard, ModelCardData
# Read a model card
card = ModelCard.load("mistralai/Mistral-7B-Instruct-v0.3")
print(card.data) # Parsed YAML frontmatter
print(card.text[:500]) # Raw markdown content
# Create a new model card
card_data = ModelCardData(
language=["en"],
license="apache-2.0",
library_name="transformers",
tags=["text-generation", "llm", "fine-tuned"],
datasets=["squad"],
base_model="meta-llama/Llama-3.1-8B-Instruct",
pipeline_tag="text-generation",
metrics=[
{"name": "Faithfulness (RAGAS)", "type": "faithfulness", "value": 0.87},
{"name": "Answer Relevancy", "type": "answer_relevancy", "value": 0.91},
]
)
card_content = f"""---
{card_data.to_yaml()}
---
# My Fine-Tuned Q&A Model
Fine-tuned from Llama 3.1 8B Instruct on a domain-specific Q&A dataset.
## Intended use
Customer support Q&A for software products. Do not use for medical, legal, or financial advice.
## Evaluation results
| Metric | Score |
|--------|-------|
| Faithfulness (RAGAS) | 0.87 |
| Answer Relevancy | 0.91 |
| Context Recall | 0.83 |
## Limitations
- English only
- Domain-specific: may perform poorly on general knowledge questions
- Context window: 4096 tokens
## Training details
- Base model: Llama 3.1 8B Instruct
- Method: QLoRA (r=16, alpha=32, dropout=0.05)
- Training data: 5,000 domain-specific QA pairs
- Hardware: 1x A100 40GB, 2 hours
"""
new_card = ModelCard(card_content)
# new_card.push_to_hub("your-username/my-model")
print("Card created successfully")
Model cards are infrastructure
Treat your model card as code. It's version-controlled, it communicates expected behavior to users, and it provides the paper trail needed for compliance audits. A model without a card is a black box — even to yourself 6 months later.