Vector Stores & Embeddings¶
1. Why this matters¶
RAG needs to find "the 4 chunks most relevant to this question" out of millions. Linear search through millions of strings would take forever. Vector stores index embeddings using approximate nearest-neighbor (ANN) algorithms (HNSW, IVF, ScaNN), giving sub-100ms search even at billions of vectors.
LangChain's VectorStore interface abstracts over a dozen backends so you can prototype on FAISS and ship on Pinecone without rewriting your app.
2. Mental model¶
flowchart LR
subgraph Indexing
D[Documents/chunks] --> E1[Embedding Model]
E1 --> V1[Vectors]
V1 --> VS[Vector Store<br/>FAISS/Chroma/Pinecone]
end
subgraph Querying
Q[Query string] --> E2[Same Embedding Model]
E2 --> V2[Query Vector]
V2 --> VS
VS --> R[Top-k similar<br/>chunks]
end
Rule: the embedding model used to index MUST be the same model used to query. Different models → different vector spaces → comparison is meaningless.
3. Architecture / Flow¶
flowchart TD
A[Chunks list of Document] --> B[OpenAIEmbeddings or any other]
B --> C[1536-dim vectors, one per chunk]
C --> D[Vector Store]
Q[Question string] --> B
B -->|query vector| D
D -->|cosine similarity| E[Top-k Documents]
E --> F[Pass to LLM prompt as context]
4. Core concepts¶
- Embedding — a fixed-length list of floats (1536 for
text-embedding-3-small) that represents text in a semantic space. Similar meanings → close vectors. - Similarity metric — usually cosine similarity (1 = identical direction, 0 = orthogonal, -1 = opposite). Sometimes dot product or L2.
- Index type — flat (exact, slow) vs HNSW / IVF (approximate, fast). Quality vs speed trade-off.
- Dimensions — 1536 for OpenAI
text-embedding-3-small, 3072 fortext-embedding-3-large. Bigger = better quality but more storage/cost. - Metadata filtering — most stores support filtering retrieval by metadata (
source = "report.pdf",year >= 2024). - Persistence — FAISS by default is in-memory (saved with
.save_local); Chroma can persist to disk; cloud stores are persistent by definition.
5. Code — minimal working example¶
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
docs = [
Document(page_content="Python is great for data science.", metadata={"id": 1}),
Document(page_content="JavaScript is the language of the web.", metadata={"id": 2}),
Document(page_content="Rust is fast and memory-safe.", metadata={"id": 3}),
]
# Build the index (embeds + stores in one call)
vs = FAISS.from_documents(docs, embeddings)
# Query
hits = vs.similarity_search("Which language is good for AI?", k=2)
for d in hits:
print(d.metadata["id"], "→", d.page_content)
Persist + reload (FAISS):
vs.save_local("./index")
vs2 = FAISS.load_local(
"./index",
embeddings,
allow_dangerous_deserialization=True, # only if you trust the file
)
6. Code — real-world pattern¶
Chroma with persistent storage + metadata filtering:
from langchain_chroma import Chroma
vs = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
collection_name="acme_docs",
persist_directory="./chroma_db",
)
# Filtered search — only inside the 2024 annual report
hits = vs.similarity_search(
query="What was Q4 revenue?",
k=4,
filter={"$and": [
{"doc_type": "annual_report"},
{"year": 2024},
]},
)
Pinecone (cloud, production):
from langchain_pinecone import PineconeVectorStore
vs = PineconeVectorStore(
index_name="acme-prod",
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
namespace="tenant_42", # multi-tenancy
)
vs.add_documents(chunks)
hits = vs.similarity_search("...", k=4)
Get similarity scores too:
hits_with_scores = vs.similarity_search_with_score("...", k=4)
for doc, score in hits_with_scores:
print(f"{score:.3f} {doc.page_content[:80]}")
Use it as a Runnable retriever (slots straight into LCEL chains):
retriever = vs.as_retriever(
search_type="similarity", # or "mmr"
search_kwargs={"k": 4, "filter": {"year": 2024}},
)
# retriever is a Runnable: chain = retriever | format_docs | prompt | model
7. Common pitfalls¶
- ❗ Mismatched embedding models between indexing and querying. Re-build the index when you change embedding models.
- ❗
allow_dangerous_deserialization=Trueon untrusted FAISS files. FAISS uses pickle — only load files you produced yourself. - ❗ Forgetting metadata. Without it, you can't filter, can't cite, can't debug. Always preserve
sourceat minimum. - ❗ Not persisting Chroma. Without
persist_directory, Chroma is in-memory and you re-index on every restart. - ❗ Re-embedding the entire corpus on small updates. Use the index API or your store's upsert with stable IDs.
- ❗ Picking the wrong embedding model dimensions. 3072-dim is better quality but ~2× the storage and slightly slower search.
text-embedding-3-small(1536) is the sweet spot for most apps.
8. When to use vs not use¶
| Vector store | Use when |
|---|---|
| FAISS (in-memory) | Local prototyping, < 1M vectors, no persistence needed |
| Chroma | Local + persistent, simple SDK, < 10M vectors |
| Pinecone | Managed cloud, instant scale, low ops overhead |
| Weaviate | Open-source + cloud, strong hybrid search (vector + keyword + filter) |
| Qdrant | Open-source + cloud, great metadata filtering, runs on Rust |
| pgvector (Postgres) | Already running Postgres, want one DB to manage |
| Milvus | Massive scale (billions of vectors), self-hosted |
| Use a vector store at all when | Skip it when |
|---|---|
| You need semantic search over > a few hundred docs | < 50 docs — just stuff them in the prompt |
| You need filtering + similarity together | Pure exact-match — use a regular DB |
9. Cheatsheet¶
# Embeddings
from langchain_openai import OpenAIEmbeddings
emb = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dims
emb.embed_query("hello") # list[float]
emb.embed_documents(["a", "b"]) # list[list[float]]
# FAISS
from langchain_community.vectorstores import FAISS
vs = FAISS.from_documents(docs, emb)
vs.save_local("./idx")
vs = FAISS.load_local("./idx", emb, allow_dangerous_deserialization=True)
# Chroma
from langchain_chroma import Chroma
vs = Chroma.from_documents(docs, emb, persist_directory="./chroma")
# Pinecone
from langchain_pinecone import PineconeVectorStore
vs = PineconeVectorStore(index_name="...", embedding=emb)
# Add / update / delete
vs.add_documents(new_docs)
vs.delete(ids=["doc-1", "doc-2"])
vs.add_texts(["plain string"], metadatas=[{"src": "x"}])
# Query
vs.similarity_search("query", k=4)
vs.similarity_search_with_score("query", k=4)
vs.max_marginal_relevance_search("query", k=4, fetch_k=20, lambda_mult=0.5)
# Filters (syntax varies per backend)
vs.similarity_search("...", filter={"year": 2024})
# As retriever (Runnable)
retriever = vs.as_retriever(search_type="mmr", search_kwargs={"k": 4})
10. Q&A — recall test¶
-
Q: Why must the indexing embedding model match the query embedding model? A: Each model has its own vector space — the same text would be at different coordinates in each. Cosine similarity between vectors from different models is meaningless.
-
Q: When would you pick MMR over straight similarity? A: When top-k similarity returns duplicates or near-duplicates. MMR (Max Marginal Relevance) trades a bit of relevance for diversity — better when the corpus has many similar chunks.
-
Q: FAISS vs Chroma for a side project? A: Chroma. It's simple, persists to disk by default, and has decent metadata filtering. FAISS is faster for raw similarity but is in-memory unless you handle save/load manually.
-
Q: How do you scope retrieval to a specific tenant in multi-tenant apps? A: Either put tenants in different collections/namespaces (Pinecone namespace, Chroma collection) OR add a
tenant_idto metadata and always filter on it. -
Q: Cost of switching embedding models from
-smallto-large? A: ~2× storage (3072 vs 1536 dims), ~2× embedding cost, slightly slower search. Quality gain is moderate. Most apps don't need-large.
Practice¶
What does this print?
Expected: True
Compute cosine similarity correctly (divide by both norms, not just one)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What does an embedding model do?
- Maps text to a fixed-length vector where semantically similar texts have similar vectors
- Generates text
- Counts tokens
- Translates languages
Why: Embeddings are the foundation of semantic search. "I love dogs" and "Canines are my favorite" should have similar vectors despite no shared words.
Q2. Which similarity metric is most commonly used with text embeddings?
- Euclidean distance
- Cosine similarity — measures angle, ignoring magnitude
- Manhattan distance
- Hamming distance
Why: Cosine similarity normalizes for vector length, focusing on direction. Two embeddings can have very different magnitudes but encode the same meaning — cosine catches that.
Q3. When should you choose text-embedding-3-large over text-embedding-3-small?
- Always
- Only when retrieval quality is the bottleneck and you can afford 2× storage and embedding cost
- For non-English
- Never
Why:
-smallis the cost/quality sweet spot for most apps. Switch to-largeonly if you've measured-smallfalling short. The quality gain is moderate; the cost hit is significant.
Common doubts¶
Which vector store should I use?
For development/learning: FAISS (local, fast, free). For production at small scale: Chroma (simple, persistent). For production at scale: pgvector (if you already have Postgres), Pinecone or Weaviate (managed services). Don't optimize this choice early — they're easy to swap later.
How big is an embedding?
Most OpenAI/sentence-transformers embeddings are 384-1536 dimensions of float32 — ~1-6 KB per chunk. For 1M chunks at 1536 dims, that's ~6 GB. Plan storage accordingly.
Can I embed images or audio?
Yes — CLIP for images (joint text+image space), Whisper-style models for audio. For unified search across modalities, use multimodal embeddings. LangChain has wrappers for many — search "MultiVectorRetriever".