Vector Stores & Databases¶

1. Why does this topic exist?¶

After embedding millions of chunks, we have millions of 1500-dim vectors. The retrieval problem: given a query vector, find the k most similar vectors in the corpus, in <100 ms.

Why SQL/NoSQL databases can't do this¶

-- Imagine you stored vectors as columns in Postgres
SELECT * FROM chunks
ORDER BY cosine_similarity(embedding, $1) DESC
LIMIT 5;

To execute, Postgres would: 1. Compute cosine similarity for every row (O(N) scan). 2. Sort by score. 3. Return top 5.

For 10M chunks: ~60 seconds. Unusable for real-time search.

The problem: databases are built for exact lookups on indexed columns, not "find me the row with the most similar 1500-dim float vector."

What we actually need¶

Fast nearest-neighbor search (sub-100ms for millions of vectors).
Disk + RAM storage that scales.
Metadata filtering (only docs from this user, this date range, etc.).
Persistence (survive restarts).
Updates (add/remove/upsert without re-indexing everything).

These requirements drove a new class of database: the vector database.

Industry pain example: A 2021 startup tried to do RAG with Postgres + cosine UDFs. Queries took 30s on 1M chunks. They were forced to migrate to FAISS, then Pinecone. Total rewrite: 6 weeks.

2. What is it?¶

Simple explanation¶

A vector database is a search engine for similarity. You give it vectors; it returns the most similar ones, fast.

Technical explanation¶

A vector database is a system that:

Stores vectors + associated metadata + original text.
Indexes them with an approximate nearest neighbor (ANN) algorithm (HNSW, IVF, PQ, ScaNN…).
Searches by similarity score (cosine, dot, Euclidean) returning top-k results.
Filters by metadata (WHERE user_id = X AND year > 2023).
Persists, replicates, scales like any other database.

Industry definition¶

The term "vector database" became standard around 2022 with the rise of LLM applications. The underlying ANN algorithms (HNSW: 2018, IVF: 2014) predate this — but vector DBs package them with the database concerns enterprises need (auth, scaling, monitoring).

Mental model¶

Think of a vector store as a library with magnetic shelves:

Books are indexed by topic in N-dimensional space (instead of alphabetical).
Similar books cluster on nearby shelves.
The librarian (ANN index) knows shortcuts — never walks every shelf.
Metadata tags on each book let you filter ("only books from 2023, only physics section").

Analogy¶

Spotify's "Songs You Might Like." Spotify embeds every song into a vector capturing genre, tempo, instrumentation, mood. Your query song's vector is compared against millions of others. The closest ones are suggested. Same math, same architecture as RAG retrieval.

3. How does it work?¶

The four moving parts¶

flowchart TD
    VS[Vector Store] --> V[Vectors - float arrays]
    VS --> D[Documents - original text]
    VS --> M[Metadata - filters, citations]
    VS --> I[Index - ANN data structure]

How ANN makes search fast: HNSW intuition¶

Hierarchical Navigable Small World is the dominant ANN algorithm.

flowchart TD
    L2[Top layer: sparse, long edges]
    L1[Middle layer: medium density]
    L0[Bottom layer: dense, every vector]
    L2 --> L1
    L1 --> L0

Search: 1. Start at top layer (sparse). Jump to closest vector to query. 2. Descend to next layer; refine. 3. Repeat until bottom layer. 4. Search local neighborhood at bottom for true top-k.

Total time: O(log n) instead of O(n). 100M vectors → ~15-20 hops → <50ms.

Other ANN algorithms¶

Algorithm	Use case	Trade-off
HNSW	General default	Best quality, more RAM
IVF (Inverted File)	Massive scale	Coarser, faster, less RAM
PQ (Product Quantization)	Tight memory	Lossy compression of vectors
IVF + PQ	Web-scale	Combines coarse partitioning + compression
ScaNN (Google)	Google services	Custom Google variant
DiskANN (Microsoft)	Disk-based	Massive scale on SSD

Most vector DBs default to HNSW; offer IVF-PQ for memory-constrained massive scale.

Search modes¶

1. Pure similarity search¶

results = vector_store.similarity_search("What is RAG?", k=4)

Returns 4 closest. Simple, fast. May return near-duplicates.

2. MMR (Maximum Marginal Relevance)¶

results = vector_store.max_marginal_relevance_search(
    "What is RAG?",
    k=4, fetch_k=20, lambda_mult=0.6,
)

Picks 4 from a pool of 20, balancing relevance with diversity.

3. Similarity with score threshold¶

results = vector_store.similarity_search_with_relevance_scores(
    "query", k=10, score_threshold=0.7,
)

Returns at most k, but only those above threshold. Use when "no answer" is better than "bad answer."

Metadata filtering¶

results = vector_store.similarity_search(
    "refund policy",
    k=4,
    filter={"customer_id": "acme-corp", "doc_type": "policy"},
)

Pre-filtering (Pinecone, Weaviate, Qdrant): filter THEN search — fast, indexed.

Post-filtering (FAISS in-memory): search THEN filter — slow on big indexes.

The 7 vector databases — deep comparison¶

1. FAISS (Facebook AI Similarity Search)¶

Architecture: Library, not a database. In-memory by default; can persist to disk via save_local. Single-process.

flowchart LR
    A[Your Python process] --> B[FAISS in-memory index]
    B --> C[Optional: save/load disk]

from langchain_community.vectorstores import FAISS

vs = FAISS.from_documents(chunks, embedder)
vs.save_local("./faiss-index")

# Load later
vs = FAISS.load_local("./faiss-index", embedder, allow_dangerous_deserialization=True)

Pros: - Fastest in-memory ANN. - Free, open-source. - Supports many ANN algorithms (HNSW, IVF, PQ). - No infrastructure needed.

Cons: - In-memory only by default — RAM limits dataset size. - No metadata filtering (limited via post-filter). - Single-process — no concurrency. - No persistence built-in; you reload from disk file. - No production features (auth, replication, monitoring).

Cost: Free. Hardware = RAM you provision.

Scalability: Up to ~10M vectors comfortably on a 64GB RAM machine. Beyond that → Pinecone/Weaviate.

Use cases: Local dev, prototypes, single-process apps, on-premise without infrastructure team.

2. ChromaDB¶

Architecture: Embedded database (think SQLite of vector DBs). Persists to local disk by default. Has a server mode (chroma serve) for multi-process.

flowchart LR
    A[Your app] --> B[Chroma client]
    B --> C[Local persistent DB]
    B --> D[Or remote Chroma server]

from langchain_community.vectorstores import Chroma

vs = Chroma.from_documents(
    chunks, embedder,
    persist_directory="./chroma-db",
)
# Auto-persists.

# Reload
vs = Chroma(persist_directory="./chroma-db", embedding_function=embedder)

Pros: - Persistent by default. - Simple API, great for development. - Supports metadata filtering. - Multi-tenancy via collections. - Open-source.

Cons: - Newer (2023+), maturing. - Performance below FAISS for raw speed. - Limited at very large scale (>10M vectors). - Server mode is still evolving.

Cost: Free (self-hosted). Chroma Cloud: $0.10/hr starter tier.

Scalability: Up to ~10M vectors local; >10M needs sharding.

Use cases: Local dev with persistence, small-to-medium prod apps, multi-collection multi-tenant.

3. Pinecone¶

Architecture: Fully managed cloud service. Closed-source. Multi-tenant, distributed by default.

flowchart LR
    A[Your app] --> B[Pinecone API]
    B --> C[Pinecone distributed cluster]
    C --> D[Sharded indexes across regions]

from pinecone import Pinecone
from langchain_community.vectorstores import PineconeVectorStore

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

vs = PineconeVectorStore(index=index, embedding=embedder)
vs.add_documents(chunks)

Pros: - Zero infrastructure to manage. - Excellent metadata filtering performance. - Auto-scaling, replication, HA. - Hybrid search built-in (dense + sparse). - SLAs for production.

Cons: - Paid only — $0.096/hr for the smallest serverless tier. - Vendor lock-in — your data lives on Pinecone's servers. - Network latency from your app to Pinecone region. - Less control over index internals.

Cost: Starter $0.096/hr; serverless pay-per-use; enterprise $1K-$10K+/month.

Scalability: Multi-billion vectors. Used in production at scale.

Use cases: Production apps where you don't want to operate a vector DB. Fortune 500 enterprises. Anyone wanting "Postgres for vectors" but cloud-only.

4. Weaviate¶

Architecture: Open-source vector DB with GraphQL API. Self-hostable OR cloud (Weaviate Cloud).

flowchart LR
    A[Your app] --> B[Weaviate GraphQL API]
    B --> C[Weaviate cluster]
    C --> D[HNSW index per class]
    C --> E[Inverted index for keywords]

import weaviate
from langchain_community.vectorstores import Weaviate

client = weaviate.connect_to_local()
vs = Weaviate(client, "MyClass", "page_content", embedding=embedder)

Pros: - Hybrid search (dense + BM25) built-in. - Strong filtering via GraphQL. - Schema-driven (typed classes, validation). - Modular: plug in your own embedder, reranker, generator. - Open-source.

Cons: - GraphQL adds a learning curve. - More ops than Pinecone if self-hosted. - Schema design upfront — less flexible than schemaless options.

Cost: Self-hosted: free. Weaviate Cloud: $0.09-$0.30 per million vectors per hour depending on tier.

Scalability: Distributed; production deployments up to billions of vectors.

Use cases: Schema-driven RAG, hybrid search out-of-the-box, enterprises that want self-host option.

5. Milvus¶

Architecture: Distributed vector DB. Cloud-native (Kubernetes), built for billion-scale.

flowchart LR
    A[Your app] --> B[Milvus Proxy]
    B --> C[Coordinator]
    B --> D[Query Node]
    B --> E[Index Node]
    B --> F[Data Node]

Microservice architecture — separates query/index/data layers. Designed for elastic scaling.

from langchain_community.vectorstores import Milvus

vs = Milvus.from_documents(
    chunks, embedder,
    connection_args={"host": "localhost", "port": "19530"},
    collection_name="my_collection",
)

Pros: - Built for scale — billions of vectors. - Many index types (HNSW, IVF, IVF-PQ, ANNOY, DiskANN). - GPU acceleration for indexing. - Open-source + managed (Zilliz Cloud).

Cons: - Operational complexity — Kubernetes setup is non-trivial. - Overkill for small-to-medium workloads. - Steeper learning curve.

Cost: Self-hosted free. Zilliz Cloud: $99/month starter tier.

Scalability: Billions of vectors, distributed across nodes.

Use cases: Web-scale search, e-commerce recommendations, very large RAG corpora.

6. Qdrant¶

Architecture: Rust-based vector DB. High performance per-node. Distributed mode available.

flowchart LR
    A[Your app] --> B[Qdrant HTTP/gRPC API]
    B --> C[Qdrant node]
    C --> D[HNSW index]
    C --> E[Payload index for filters]

from langchain_qdrant import Qdrant

vs = Qdrant.from_documents(
    chunks, embedder,
    url="http://localhost:6333",
    collection_name="my_collection",
)

Pros: - Written in Rust → very fast and memory-efficient. - Excellent filtering via "payload index" — fast even on huge corpora. - Quantization support — binary, scalar, product quantization. - Open-source + Qdrant Cloud. - Good docs, active development.

Cons: - Smaller ecosystem than Pinecone/Weaviate. - Fewer integrations vs Weaviate's GraphQL ecosystem.

Cost: Self-hosted free. Qdrant Cloud: $0.0142/hr for 1GB cluster.

Scalability: Single-node handles 100M+; distributed for billions.

Use cases: High-throughput production RAG, filter-heavy workloads, teams that value performance + control.

7. pgvector (Postgres extension)¶

Architecture: A Postgres extension that adds a vector column type and ANN index.

flowchart LR
    A[Your app] --> B[Postgres]
    B --> C[Tables: chunks, vectors, metadata]
    C --> D[pgvector HNSW or IVFFlat index]

CREATE EXTENSION vector;
CREATE TABLE chunks (
  id bigserial PRIMARY KEY,
  page_content text,
  metadata jsonb,
  embedding vector(1536)
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

from langchain_postgres import PGVector

vs = PGVector(
    connection_string="postgresql+psycopg://user:pwd@host/db",
    embedding_function=embedder,
    collection_name="my_collection",
)

Pros: - Use your existing Postgres — no new infrastructure. - Transactional consistency with the rest of your data. - Rich SQL filters + JOINs. - Mature, battle-tested DB. - Free if you already run Postgres.

Cons: - HNSW in pgvector is slower than dedicated vector DBs at scale. - Scaling vectors is harder than scaling rows. - RAM-hungry for large indexes.

Cost: Free if existing Postgres. Cloud: standard Postgres pricing.

Scalability: Comfortable up to ~10-50M vectors. Beyond → dedicated vector DB.

Use cases: Apps already on Postgres, where vectors are a small piece of the data model. Avoid for vector-heavy workloads or massive scale.

The decision matrix¶

	FAISS	Chroma	Pinecone	Weaviate	Milvus	Qdrant	pgvector
Type	Library	Embedded DB	Managed cloud	OSS + cloud	OSS + cloud	OSS + cloud	Postgres ext
Persistence	Manual	Yes	Yes	Yes	Yes	Yes	Yes
Filtering	Limited	Yes	Excellent	Excellent	Yes	Excellent	SQL-native
Hybrid search	No	Limited	Yes	Yes	Limited	Yes	Limited
GPU support	Yes	No	No	No	Yes	No	No
Scalability	10M	10M	Billions	Billions	Billions	Billions	10-50M
Best for	Prototypes	Small prod	Managed prod	Hybrid + schema	Massive scale	High perf	Already on Postgres

4. Visual Learning¶

Architecture — production vector DB¶

flowchart TD
    subgraph CLIENT[Client]
        APP[App server]
    end
    subgraph VDB[Vector DB]
        API[API gateway]
        QN[Query nodes]
        IN[Index nodes]
        SN[Storage nodes]
    end
    APP --> API
    API --> QN
    QN --> SN
    IN --> SN

Workflow — typical query¶

flowchart LR
    A[App receives query] --> B[Embed query]
    B --> C[Send vector + filter to VDB]
    C --> D[VDB pre-filters by metadata]
    D --> E[ANN search HNSW]
    E --> F[Top-k vectors]
    F --> G[Hydrate with text + metadata]
    G --> H[Return to app]

Sequence — full RAG with vector DB¶

sequenceDiagram
    actor U as User
    participant APP as App
    participant EMB as Embedder
    participant VDB as Vector DB
    participant LLM as LLM
    U->>APP: query
    APP->>EMB: embed query
    EMB-->>APP: vector
    APP->>VDB: search vector + filter
    VDB-->>APP: top-k chunks
    APP->>LLM: prompt with chunks
    LLM-->>APP: answer
    APP-->>U: answer with citations

Real-world example — multi-tenant SaaS¶

flowchart LR
    Q[User from Tenant A] --> APP[App server]
    APP --> EMB[Embed query]
    EMB --> VS[(Pinecone)]
    VS --> F[Filter: tenant_id = A]
    F --> ANN[ANN search]
    ANN --> R[Tenant A's top-k chunks]
    R --> LLM[LLM]
    LLM --> ANS[Answer for Tenant A]

A single vector store, namespaced by tenant_id filter, serves multiple customers safely.

5. Pros (vector DBs generally)¶

Sub-100ms search on millions of vectors.
Metadata filtering for multi-tenant and contextual queries.
Persistence + scaling as standard DB concerns.
Hybrid search (dense + sparse) in many.
Replication, HA in managed services.
Pluggable embedders — store vectors from any model.

6. Cons¶

Yet another database to operate (unless managed).
Eventually consistent in distributed setups.
Index rebuild required for some ANN tuning.
RAM hungry — HNSW indexes live in RAM for speed.
Cost at scale ($1K-$10K+/month for production).
Vendor lock-in for managed services.

7. Trade-offs¶

Choice	Trade-off
Managed (Pinecone) vs self-hosted	No ops vs cost + lock-in
HNSW vs IVF-PQ	Quality vs memory
Pre-filter vs post-filter	Speed vs flexibility
Single-node vs distributed	Simplicity vs scale
Schema-driven (Weaviate) vs schemaless (Chroma)	Validation vs flexibility

Decision rule:

Prototype → FAISS or Chroma.
Production, small team → Pinecone (managed) or Qdrant Cloud.
Already on Postgres + <10M vectors → pgvector.
Web-scale → Milvus or Pinecone enterprise.
Hybrid + schema requirements → Weaviate.

8. Real-world Industry Usage¶

OpenAI¶

ChatGPT memory uses an internal vector store.
Assistants File Search uses a managed Pinecone-like backend.

Anthropic¶

Doesn't publish vector store choice; speculation: a managed cloud solution.
Anthropic Tools API doesn't include native vector store — users bring their own.

Google¶

Vertex AI Vector Search — internal version of ScaNN.
NotebookLM uses Google's proprietary infrastructure.

Enterprise¶

Notion AI uses Pinecone in production.
Quora's Poe uses Weaviate.
Shopify uses a custom in-house vector DB.
Microsoft integrates Azure AI Search (their managed vector DB) into Copilot.
Salesforce Einstein uses pgvector inside their existing Postgres infrastructure.

Production patterns¶

Pattern	Where
Pinecone serverless	Most startups; quick to set up
Qdrant self-hosted	Performance-focused teams
pgvector	Postgres-heavy shops
Multi-region replication	Global apps; cut query latency
Sharded by tenant	Multi-tenant SaaS

9. Interview Questions¶

Beginner¶

Why not just use Postgres for vectors? — Without an ANN index, every query is O(N). Vector DBs use HNSW etc. for O(log N).
What's HNSW? — Hierarchical Navigable Small World — a graph-based ANN index.
Pinecone vs Chroma — quick differences? — Pinecone: managed cloud, no ops, paid. Chroma: open-source, runs locally, free.

Intermediate¶

Pre-filter vs post-filter — which is faster? — Pre-filter (Pinecone, Weaviate, Qdrant). Post-filter (FAISS) can be 10× slower on selective filters.
How does HNSW achieve O(log N) search? — Hierarchical graph: top layers are sparse (fast jumps), bottom is dense (precise). Search descends layer by layer.
When pick Milvus over Pinecone? — When you need self-hosting + extreme scale (billions). Pinecone is easier; Milvus is more control.
What's "ef_search" in HNSW? — Number of candidates explored at query time. Higher → better recall, slower. Production tuning knob.

Advanced¶

Index types beyond HNSW — when to use IVF-PQ? — When RAM is tight at massive scale. PQ compresses vectors lossy; IVF coarsely partitions. Sacrifices ~5% recall for 5-10× memory reduction.
How do you handle index updates at scale? — Batch upserts (1000 at a time); blue-green index swapping (build new in background, atomic switch); tombstones for deletes.
Pinecone serverless vs pod-based — what's the difference? — Serverless: usage-based pricing, eventually consistent reads. Pod-based: dedicated capacity, strongly consistent. Trade cost for predictability.

System design¶

Design a vector DB for 1B chunks across 50 customers, each isolated. — Either: shard by tenant (separate indexes); OR shared index with tenant_id filter + strict ACL middleware. Multi-region replication for latency. Pre-filter for performance.
Your team's RAG p99 latency degraded from 200ms → 2s. Diagnose. — Possibilities: (1) index grew past RAM, paging from disk; (2) HNSW ef_search too high; (3) filter selectivity dropped; (4) network hop to managed VDB. Run with profiling on; check VDB metrics.

10. Common Mistakes¶

Beginners¶

❌ Using FAISS in production without persistence backup.
❌ Not setting metadata at insertion → can't filter later.
❌ Mixing embedders → vectors live in different spaces, junk results.
❌ allow_dangerous_deserialization=True on untrusted FAISS indexes (RCE risk).

Production teams¶

❌ Hard-coding the vector DB in business logic → migration nightmare.
❌ No index versioning → can't safely rebuild.
❌ Skipping metric monitoring → silent quality degradation as data grows.
❌ Over-fetching with k=50 when k=5 is enough → wastes LLM tokens + latency.

How to avoid¶

Abstract vector DB behind an interface in your app.
Tag indexes with (embedder, splitter_config, version).
Run RAGAS weekly to catch quality drift.
Profile filter selectivity — bad filters = slow queries.

11. Best Practices¶

Industry standards¶

Use HNSW as default index. Adjust ef_construction / ef_search per workload.
Cosine similarity for text embeddings (or pre-normalized + dot product, equivalent).
Metadata schema with (source, doc_id, chunk_idx, ingested_at, tenant_id) always.
Version your indexes for safe rebuilds.

Production¶

Managed services if you're a small team. Avoid operating yet another database.
Per-tenant filters for multi-tenant. Audit ACL carefully.
Async upsert during indexing.
Health checks for production VDB clusters.

Optimization¶

Quantization (Qdrant binary, Pinecone scaling): 32× storage reduction.
Sharding by tenant or topic for very large corpora.
Caching common queries.
Co-locate app and VDB in the same region — network is your latency floor.

12. Evolution Story¶

flowchart LR
    A[SQL with vector UDF<br/>O N scan, unusable] --> B[FAISS library<br/>in-memory ANN]
    B --> C[Chroma<br/>persistent, easy]
    C --> D[Pinecone<br/>managed cloud]
    D --> E[Weaviate / Qdrant / Milvus<br/>OSS production options]
    E --> F[pgvector<br/>vectors meet existing DBs]
    F --> G[Binary quantization<br/>32x storage cut]
    G --> H[Disk-based indexes DiskANN<br/>massive scale on SSD]

Where we are: Vector DBs are a stable category. The choice is operational (managed vs self-hosted, cloud vs on-prem), not capability.

Where we're going (next chapter): Vector DBs return raw similarity results. But raw similarity isn't always what we want. We need a Retriever abstraction that wraps the vector store and supports search variations — pure similarity, MMR, threshold, AND we'll learn about hybrid retrieval (dense + sparse / BM25) to fix vector search's blind spot for exact terms.

Practice¶

What does this print?

Expected: 4

default_k = 4
print(default_k)

Use HNSW (not flat / brute force) for million-vector indexes

Expected: True

index_type = "flat"             # bug: O(N) at query time
is_fast = index_type in ("hnsw", "ivf", "ivf_pq")
print(not is_fast)

Quiz — Quick check¶

What you remember

Q1. What does HNSW give us?

O(log N) approximate nearest neighbor search
Exact search
Storage compression
Embedding generation

Q2. Why is pre-filtering (Pinecone, Weaviate) faster than post-filtering (FAISS)?

Indexed filters narrow the search BEFORE ANN; post-filter scans then filters
Pinecone is just faster overall
Post-filter requires more LLM calls
No real difference

Q3. Which vector DB is best if you're already running Postgres for <10M vectors?

Pinecone
Milvus
pgvector
FAISS

Common doubts¶

Should I always use HNSW?

Default to HNSW. Switch to IVF or IVF-PQ when RAM is tight and you can tolerate slight recall loss. Most managed services default to HNSW for you.

Can I mix vector DBs in one app?

Yes — but rare. Common pattern: Pinecone for prod, FAISS for unit tests / local dev. Abstract behind a Retriever interface.

How big should k be at retrieval time?

Start with k=4 for chat-style RAG. Bigger k = more context, more cost, "lost in the middle." Use a reranker (Chapter 7) to retrieve big (k=20) and refine to small (k=4).

→ Retrievers — Dense, Sparse, Hybrid