Skip to content

Vector Database Overview

A vector database is a storage system optimized for embedding vectors. Beyond raw similarity search, production vector databases add: persistence, filtering by metadata, CRUD operations (delete/update), multi-tenancy, and monitoring.

Learning objectives

  • Compare the capabilities of ChromaDB, Pinecone, and Qdrant
  • Understand the indexing algorithms (HNSW, IVF) each database uses
  • Choose a database based on operational requirements

Why not just FAISS?

FAISS is a library — it's excellent for similarity search but provides no:

  • Persistence: You must save/load the index manually and keep metadata in a separate store
  • Filtering: FAISS has no metadata filtering — you filter post-retrieval
  • CRUD: Deleting a vector requires rebuilding the index
  • Multi-tenancy: No concept of users or namespaces
  • Monitoring: No built-in metrics or observability

Vector databases wrap an ANN library with these production features.


2025 production landscape

Database Deployment Best for Pricing model
ChromaDB Local / self-hosted Development, small projects Free (open source)
Pinecone Fully managed cloud Teams who want zero ops Per vector stored + queries
Qdrant Self-hosted / managed cloud Production, filtering-heavy Free self-hosted / usage-based cloud
Weaviate Self-hosted / managed GraphQL API, multi-modal Free self-hosted / usage-based
Milvus Self-hosted (Kubernetes) Enterprise, very large scale Free open source
pgvector PostgreSQL extension Existing Postgres users Free

Feature matrix

Feature ChromaDB Pinecone Qdrant
Persistent storage ✅ (PersistentClient)
Metadata filtering ✅ (basic) ✅ (advanced)
Hybrid search ✅ (sparse-dense) ✅ (sparse + dense)
Deletions
Namespaces/tenancy ✅ (collections) ✅ (namespaces) ✅ (collections)
Self-hosted
Serverless ✅ (cloud)
Python SDK
Index algorithm HNSW proprietary HNSW

When to choose each

Starting a new project or prototyping?
    → ChromaDB (zero setup, great Python API)

Building for production with no ops team?
    → Pinecone Serverless (managed, scales automatically)

Need advanced filtering, self-hosted control, or open source?
    → Qdrant (best filtering language, Docker-friendly)

Already running PostgreSQL?
    → pgvector (add vector search to your existing DB)

Very large scale (100M+ vectors), enterprise?
    → Milvus or Weaviate

Indexing algorithms

All major vector databases use HNSW (Hierarchical Navigable Small World) as their primary ANN algorithm:

HNSW key parameters:
- m: number of bidirectional links per node (16–64)
  Higher = better recall, more memory, slower build
- ef_construction: quality of graph during build (64–512)
  Higher = better quality, slower build
- ef (at query time): search beam width (32–256)
  Higher = better recall, slower query

Typical production settings: - m = 16 — good default for most use cases - ef_construction = 200 — build once, build well - ef = 64 — tune for your recall/latency target


Distance metrics

Most vector databases support multiple distance functions:

Metric Formula When to use
Cosine 1 - (A·B)/(‖A‖‖B‖) Default for text embeddings
Dot product -A·B Pre-normalized embeddings (fastest)
Euclidean (L2) ‖A-B‖₂ Image features, when magnitude matters
Manhattan (L1) Σ Aᵢ-Bᵢ

Use cosine for text, pre-normalize for speed

Normalizing embeddings to unit length and using dot product is mathematically equivalent to cosine similarity but faster at query time (no division). Do this pre-normalization at ingest time.


00-agenda | 02-chromadb