Cosine Similarity¶
Cosine similarity is the most common metric for comparing embedding vectors. Understanding it from first principles lets you choose the right metric for the right problem and debug retrieval failures.
Learning objectives¶
- Compute cosine similarity from first principles with NumPy
- Explain why cosine similarity is preferred over Euclidean distance for embeddings
- Implement efficient batch similarity computation
- Know when to use dot product vs. cosine similarity vs. L2 distance
The formula¶
Cosine similarity measures the angle between two vectors, ignoring their magnitude:
Where A · B is the dot product and ‖A‖ is the L2 norm (magnitude).
- Result range: -1 to +1
- 1.0 = identical direction (most similar)
- 0.0 = orthogonal (unrelated)
- -1.0 = opposite direction (most dissimilar)
For embeddings trained with cosine loss, scores typically fall between 0.0 and 1.0 in practice.
import numpy as np
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0
return float(dot_product / (norm_a * norm_b))
# Toy example
v1 = np.array([1.0, 0.5, 0.2])
v2 = np.array([0.9, 0.6, 0.1])
v3 = np.array([-1.0, -0.5, -0.2]) # opposite of v1
print(f"v1 vs v2: {cosine_similarity(v1, v2):.4f}") # ~0.998 (very similar)
print(f"v1 vs v3: {cosine_similarity(v1, v3):.4f}") # -1.000 (opposite)
print(f"v1 vs [0,1,0]: {cosine_similarity(v1, np.array([0,1,0])):.4f}") # partial similarity
Why cosine, not Euclidean distance?¶
Euclidean distance measures how far apart two points are in space. Cosine similarity measures the angle between them.
# Problem with Euclidean distance on embeddings
v_short = np.array([0.5, 0.5]) # short vector, 45° angle
v_long = np.array([5.0, 5.0]) # long vector, same 45° angle
# Euclidean: they look very different
print(f"Euclidean: {np.linalg.norm(v_short - v_long):.3f}") # 6.364
# Cosine: they're identical (same direction)
print(f"Cosine: {cosine_similarity(v_short, v_long):.3f}") # 1.000
Embedding models produce vectors of varying magnitudes — a long document may produce a larger-magnitude vector than a short one even if they cover the same topic. Cosine similarity normalizes this out.
When magnitude matters
Some models (like CLIP for images) are trained to encode confidence in the vector magnitude. For these, dot product (without normalization) can be better than cosine similarity. For sentence transformers and OpenAI embedding models, cosine similarity is correct.
Dot product as a fast approximation¶
If embeddings are pre-normalized (L2 norm = 1), dot product equals cosine similarity — no division needed. This is much faster at scale.
def normalize(vectors: np.ndarray) -> np.ndarray:
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
return vectors / np.where(norms == 0, 1, norms)
# Normalized embeddings: dot product == cosine similarity
v1_norm = normalize(v1.reshape(1, -1))[0]
v2_norm = normalize(v2.reshape(1, -1))[0]
dot_sim = float(np.dot(v1_norm, v2_norm))
cos_sim = cosine_similarity(v1, v2)
print(f"Dot product (normalized): {dot_sim:.6f}")
print(f"Cosine similarity: {cos_sim:.6f}")
# Both produce the same result
Pre-normalize at index time
Normalize all your document embeddings when you insert them into the index. At query time, normalize the query vector and use dot product. This halves the computation vs. computing full cosine similarity.
Batch similarity computation¶
Computing similarity between one query and many documents efficiently:
def top_k_similar(
query_embedding: np.ndarray,
doc_embeddings: np.ndarray,
k: int = 5
) -> list[tuple[int, float]]:
"""Return indices and scores of top-k most similar documents."""
# Normalize both
q_norm = query_embedding / (np.linalg.norm(query_embedding) + 1e-8)
d_norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
d_normalized = doc_embeddings / (d_norms + 1e-8)
# Batch dot product: (1, D) @ (D, N) → (1, N)
scores = np.dot(q_norm, d_normalized.T)
# Get top-k indices (largest scores first)
top_indices = np.argsort(scores)[::-1][:k]
return [(int(idx), float(scores[idx])) for idx in top_indices]
# Example with real embeddings
from openai import OpenAI
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def get_embedding(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
docs = [
"Python is a high-level programming language",
"Django is a web framework for Python",
"The Eiffel Tower is in Paris",
"Machine learning requires large datasets",
"Neural networks are inspired by the brain",
"France is a country in Western Europe"
]
print("Embedding documents...")
doc_embeddings = np.array([get_embedding(doc) for doc in docs])
query = "deep learning and AI"
query_emb = get_embedding(query)
results = top_k_similar(query_emb, doc_embeddings, k=3)
print(f"\nTop 3 results for '{query}':")
for rank, (idx, score) in enumerate(results, 1):
print(f" {rank}. [{score:.3f}] {docs[idx]}")
Similarity matrix for clustering¶
Computing all pairwise similarities between a set of documents:
def similarity_matrix(embeddings: np.ndarray) -> np.ndarray:
"""Compute N×N cosine similarity matrix."""
normalized = embeddings / (np.linalg.norm(embeddings, axis=1, keepdims=True) + 1e-8)
return normalized @ normalized.T
# Find the most similar pair in a document set
def most_similar_pair(docs: list[str], embeddings: np.ndarray) -> tuple[str, str, float]:
sim_matrix = similarity_matrix(embeddings)
np.fill_diagonal(sim_matrix, -1) # ignore self-similarity
idx = np.unravel_index(np.argmax(sim_matrix), sim_matrix.shape)
return docs[idx[0]], docs[idx[1]], float(sim_matrix[idx])
# Find the most dissimilar pair (useful for diverse sampling)
def most_diverse_pair(docs: list[str], embeddings: np.ndarray) -> tuple[str, str, float]:
sim_matrix = similarity_matrix(embeddings)
idx = np.unravel_index(np.argmin(sim_matrix), sim_matrix.shape)
return docs[idx[0]], docs[idx[1]], float(sim_matrix[idx])
sim_mat = similarity_matrix(doc_embeddings)
d1, d2, score = most_similar_pair(docs, doc_embeddings)
print(f"\nMost similar pair (score={score:.3f}):\n '{d1}'\n '{d2}'")
Distance metrics comparison¶
| Metric | Formula | Range | Use case |
|---|---|---|---|
| Cosine similarity | A·B / (‖A‖‖B‖) | [-1, 1] | Semantic similarity (standard) |
| Dot product | A·B | (-∞, +∞) | Normalized vectors, recommendation |
| Euclidean (L2) | ‖A - B‖ | [0, +∞) | Image features, when magnitude matters |
| Manhattan (L1) | Σ | Aᵢ - Bᵢ |
def compare_metrics(a: np.ndarray, b: np.ndarray) -> dict:
return {
"cosine_similarity": cosine_similarity(a, b),
"dot_product": float(np.dot(a, b)),
"euclidean_distance": float(np.linalg.norm(a - b)),
"manhattan_distance": float(np.sum(np.abs(a - b)))
}
# Real embedding comparison
emb1 = get_embedding("The cat sat on the mat")
emb2 = get_embedding("A feline rested on a rug")
emb3 = get_embedding("The stock market fell today")
print("Similar sentences (cat/feline):")
print(compare_metrics(np.array(emb1), np.array(emb2)))
print("\nDissimilar sentences (cat/stock):")
print(compare_metrics(np.array(emb1), np.array(emb3)))
Common mistakes¶
Comparing embeddings from different models
Embeddings from text-embedding-3-small and all-MiniLM-L6-v2 live in completely different vector spaces. Cosine similarity between them is meaningless. Always use the same model for query and document embeddings.
Don't use raw cosine distance scores as absolute thresholds
A score of 0.8 from model A is not equivalent to 0.8 from model B. Calibrate thresholds on your own data (e.g., "for this model, scores above 0.75 indicate relevant results").
Zero vectors cause division by zero
If an input produces a zero vector (rare but possible with empty strings or degenerate inputs), cosine similarity is undefined. Always guard with a small epsilon: norm + 1e-8.