Vector Stores & Embeddings¶

1. Why this matters¶

RAG needs to find "the 4 chunks most relevant to this question" out of millions. Linear search through millions of strings would take forever. Vector stores index embeddings using approximate nearest-neighbor (ANN) algorithms (HNSW, IVF, ScaNN), giving sub-100ms search even at billions of vectors.

LangChain's VectorStore interface abstracts over a dozen backends so you can prototype on FAISS and ship on Pinecone without rewriting your app.

2. Mental model¶

flowchart LR
    subgraph Indexing
      D[Documents/chunks] --> E1[Embedding Model]
      E1 --> V1[Vectors]
      V1 --> VS[Vector Store<br/>FAISS/Chroma/Pinecone]
    end
    subgraph Querying
      Q[Query string] --> E2[Same Embedding Model]
      E2 --> V2[Query Vector]
      V2 --> VS
      VS --> R[Top-k similar<br/>chunks]
    end

Rule: the embedding model used to index MUST be the same model used to query. Different models → different vector spaces → comparison is meaningless.

3. Architecture / Flow¶

flowchart TD
    A[Chunks list of Document] --> B[OpenAIEmbeddings or any other]
    B --> C[1536-dim vectors, one per chunk]
    C --> D[Vector Store]
    Q[Question string] --> B
    B -->|query vector| D
    D -->|cosine similarity| E[Top-k Documents]
    E --> F[Pass to LLM prompt as context]

4. Core concepts¶

Embedding — a fixed-length list of floats (1536 for text-embedding-3-small) that represents text in a semantic space. Similar meanings → close vectors.
Similarity metric — usually cosine similarity (1 = identical direction, 0 = orthogonal, -1 = opposite). Sometimes dot product or L2.
Index type — flat (exact, slow) vs HNSW / IVF (approximate, fast). Quality vs speed trade-off.
Dimensions — 1536 for OpenAI text-embedding-3-small, 3072 for text-embedding-3-large. Bigger = better quality but more storage/cost.
Metadata filtering — most stores support filtering retrieval by metadata (source = "report.pdf", year >= 2024).
Persistence — FAISS by default is in-memory (saved with .save_local); Chroma can persist to disk; cloud stores are persistent by definition.

5. Code — minimal working example¶

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="Python is great for data science.", metadata={"id": 1}),
    Document(page_content="JavaScript is the language of the web.", metadata={"id": 2}),
    Document(page_content="Rust is fast and memory-safe.", metadata={"id": 3}),
]

# Build the index (embeds + stores in one call)
vs = FAISS.from_documents(docs, embeddings)

# Query
hits = vs.similarity_search("Which language is good for AI?", k=2)
for d in hits:
    print(d.metadata["id"], "→", d.page_content)

Persist + reload (FAISS):

vs.save_local("./index")

vs2 = FAISS.load_local(
    "./index",
    embeddings,
    allow_dangerous_deserialization=True,   # only if you trust the file
)

6. Code — real-world pattern¶

Chroma with persistent storage + metadata filtering:

from langchain_chroma import Chroma

vs = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    collection_name="acme_docs",
    persist_directory="./chroma_db",
)

# Filtered search — only inside the 2024 annual report
hits = vs.similarity_search(
    query="What was Q4 revenue?",
    k=4,
    filter={"$and": [
        {"doc_type": "annual_report"},
        {"year": 2024},
    ]},
)

Pinecone (cloud, production):

from langchain_pinecone import PineconeVectorStore

vs = PineconeVectorStore(
    index_name="acme-prod",
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    namespace="tenant_42",          # multi-tenancy
)
vs.add_documents(chunks)
hits = vs.similarity_search("...", k=4)

Get similarity scores too:

hits_with_scores = vs.similarity_search_with_score("...", k=4)
for doc, score in hits_with_scores:
    print(f"{score:.3f}  {doc.page_content[:80]}")

Use it as a Runnable retriever (slots straight into LCEL chains):

retriever = vs.as_retriever(
    search_type="similarity",   # or "mmr"
    search_kwargs={"k": 4, "filter": {"year": 2024}},
)
# retriever is a Runnable: chain = retriever | format_docs | prompt | model

7. Common pitfalls¶

❗ Mismatched embedding models between indexing and querying. Re-build the index when you change embedding models.
❗ allow_dangerous_deserialization=True on untrusted FAISS files. FAISS uses pickle — only load files you produced yourself.
❗ Forgetting metadata. Without it, you can't filter, can't cite, can't debug. Always preserve source at minimum.
❗ Not persisting Chroma. Without persist_directory, Chroma is in-memory and you re-index on every restart.
❗ Re-embedding the entire corpus on small updates. Use the index API or your store's upsert with stable IDs.
❗ Picking the wrong embedding model dimensions. 3072-dim is better quality but ~2× the storage and slightly slower search. text-embedding-3-small (1536) is the sweet spot for most apps.

8. When to use vs not use¶

Vector store	Use when
FAISS (in-memory)	Local prototyping, < 1M vectors, no persistence needed
Chroma	Local + persistent, simple SDK, < 10M vectors
Pinecone	Managed cloud, instant scale, low ops overhead
Weaviate	Open-source + cloud, strong hybrid search (vector + keyword + filter)
Qdrant	Open-source + cloud, great metadata filtering, runs on Rust
pgvector (Postgres)	Already running Postgres, want one DB to manage
Milvus	Massive scale (billions of vectors), self-hosted

Use a vector store at all when	Skip it when
You need semantic search over > a few hundred docs	< 50 docs — just stuff them in the prompt
You need filtering + similarity together	Pure exact-match — use a regular DB

9. Cheatsheet¶

# Embeddings
from langchain_openai import OpenAIEmbeddings
emb = OpenAIEmbeddings(model="text-embedding-3-small")   # 1536 dims
emb.embed_query("hello")            # list[float]
emb.embed_documents(["a", "b"])     # list[list[float]]

# FAISS
from langchain_community.vectorstores import FAISS
vs = FAISS.from_documents(docs, emb)
vs.save_local("./idx")
vs = FAISS.load_local("./idx", emb, allow_dangerous_deserialization=True)

# Chroma
from langchain_chroma import Chroma
vs = Chroma.from_documents(docs, emb, persist_directory="./chroma")

# Pinecone
from langchain_pinecone import PineconeVectorStore
vs = PineconeVectorStore(index_name="...", embedding=emb)

# Add / update / delete
vs.add_documents(new_docs)
vs.delete(ids=["doc-1", "doc-2"])
vs.add_texts(["plain string"], metadatas=[{"src": "x"}])

# Query
vs.similarity_search("query", k=4)
vs.similarity_search_with_score("query", k=4)
vs.max_marginal_relevance_search("query", k=4, fetch_k=20, lambda_mult=0.5)

# Filters (syntax varies per backend)
vs.similarity_search("...", filter={"year": 2024})

# As retriever (Runnable)
retriever = vs.as_retriever(search_type="mmr", search_kwargs={"k": 4})

10. Q&A — recall test¶

Q: Why must the indexing embedding model match the query embedding model? A: Each model has its own vector space — the same text would be at different coordinates in each. Cosine similarity between vectors from different models is meaningless.
Q: When would you pick MMR over straight similarity? A: When top-k similarity returns duplicates or near-duplicates. MMR (Max Marginal Relevance) trades a bit of relevance for diversity — better when the corpus has many similar chunks.
Q: FAISS vs Chroma for a side project? A: Chroma. It's simple, persists to disk by default, and has decent metadata filtering. FAISS is faster for raw similarity but is in-memory unless you handle save/load manually.
Q: How do you scope retrieval to a specific tenant in multi-tenant apps? A: Either put tenants in different collections/namespaces (Pinecone namespace, Chroma collection) OR add a tenant_id to metadata and always filter on it.
Q: Cost of switching embedding models from -small to -large? A: ~2× storage (3072 vs 1536 dims), ~2× embedding cost, slightly slower search. Quality gain is moderate. Most apps don't need -large.

Practice¶

What does this print?

Expected: True

# Cosine similarity between identical vectors is 1.0
import numpy as np
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.0])
cos_sim = (a @ b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(round(cos_sim, 2) == 1.0)

Compute cosine similarity correctly (divide by both norms, not just one)

Expected: True

import numpy as np
a = np.array([3.0, 4.0])
b = np.array([1.0, 0.0])
cos = (a @ b) / np.linalg.norm(a)              # bug: forgot to divide by |b|
print(0 < cos < 1)

Quiz — Quick check¶

What you remember

Q1. What does an embedding model do?

Maps text to a fixed-length vector where semantically similar texts have similar vectors
Generates text
Counts tokens
Translates languages

Why: Embeddings are the foundation of semantic search. "I love dogs" and "Canines are my favorite" should have similar vectors despite no shared words.

Q2. Which similarity metric is most commonly used with text embeddings?

Euclidean distance
Cosine similarity — measures angle, ignoring magnitude
Manhattan distance
Hamming distance

Why: Cosine similarity normalizes for vector length, focusing on direction. Two embeddings can have very different magnitudes but encode the same meaning — cosine catches that.

Q3. When should you choose text-embedding-3-large over text-embedding-3-small?

Always
Only when retrieval quality is the bottleneck and you can afford 2× storage and embedding cost
For non-English
Never

Why: -small is the cost/quality sweet spot for most apps. Switch to -large only if you've measured -small falling short. The quality gain is moderate; the cost hit is significant.

Common doubts¶

Which vector store should I use?

For development/learning: FAISS (local, fast, free). For production at small scale: Chroma (simple, persistent). For production at scale: pgvector (if you already have Postgres), Pinecone or Weaviate (managed services). Don't optimize this choice early — they're easy to swap later.

How big is an embedding?

Most OpenAI/sentence-transformers embeddings are 384-1536 dimensions of float32 — ~1-6 KB per chunk. For 1M chunks at 1536 dims, that's ~6 GB. Plan storage accordingly.

Can I embed images or audio?

Yes — CLIP for images (joint text+image space), Whisper-style models for audio. For unified search across modalities, use multimodal embeddings. LangChain has wrappers for many — search "MultiVectorRetriever".