Pgvector
Table of Contents
pgvector is an open-source extension for PostgreSQL that allows you to store, query, and index vector embeddings directly within your database.
How it Works:
- Storage: It introduces a new data type,
vector. You can store arrays of numbers (embeddings generated by AI models like OpenAI, Hugging Face, or Llama) in standard table columns. - Distance Metrics: It provides built-in operators to calculate similarity between vectors using:
- Cosine distance (
<->): Best for most AI embeddings. - L2 distance (
<->): Euclidean distance. - Inner product (
<#>).
- Cosine distance (
- Indexing: To keep searches fast as data grows, it uses approximate nearest neighbor (ANN) algorithms:
- HNSW (Hierarchical Navigable Small Worlds): High performance and great recall.
- IVFFlat: Divides vectors into lists for faster scanning.
- Integration: Since it’s just a Postgres extension, you can combine vector searches with standard SQL features like
WHEREclauses, joins, and ACID transactions in a single query.
Setup
Pull down container and spin it up
docker pull pgvector/pgvector:pg18
docker run -d --name pgvector -e POSTGRES_PASSWORD=postgres -p 5432:5432 pgvector/pgvector:pg18
Create the extension
docker exec pgvector psql -U postgres -c "CREATE EXTENSION IF NOT EXISTS vector;"
CREATE EXTENSION
Playing Around For Understanding
Create Table
docker exec pgvector psql -U postgres -c "
CREATE TABLE IF NOT EXISTS items (
id serial PRIMARY KEY,
content text,
embedding vector(3)
);"
CREATE TABLE
Insert Some Sample Vectors
docker exec pgvector psql -U postgres -c "
INSERT INTO items (content, embedding) VALUES
('Apple', '[1, 0, 0]'),
('Banana', '[1, 1, 0]'),
('SpaceShuttle', '[0, 0, 1]')
ON CONFLICT DO NOTHING;"
INSERT 0 3
Query for similarity
docker exec pgvector psql -U postgres -t -A -c "
SELECT content, 1 - (embedding <=> '[0.9, 0.1, 0.1]') AS similarity
FROM items
ORDER BY similarity DESC;"
| Apple | 0.9878783591404701 |
| Banana | 0.7761505613783749 |
| SpaceShuttle | 0.10976426667009398 |
Understanding the result
The results demonstrate Semantic Similarity versus Keyword Matching. Here is the breakdown:
1. The Numbers (The Math)
The result of 1 - (embedding <=> '[query]') converts “Cosine Distance” into a Similarity Score:
- 1.0: Perfect match (pointing in the exact same direction).
- 0.0: Orthogonal (completely unrelated/perpendicular).
- Negative: Pointing in opposite directions.
2. Physical Interpretation
Imagine your data points in a 3D room:
- Your query was
[0.9, 0.1, 0.1]. This is almost entirely on the X-axis. - Apple (
[1, 0, 0]) is also sitting strictly on the X-axis. Because they “point” in nearly the same direction, the score is very high (0.98). - Banana (
[1, 1, 0]) is at a 45-degree angle between X and Y. It’s “close-ish” but pulling away from the X-axis, resulting in a lower score (0.77). - Space Shuttle (
[0, 0, 1]) is on the Z-axis. It is perpendicular to your query. The score is very low (0.10).
3. The Significance for AI
In a real-world RAG (Retrieval-Augmented Generation) system:
- The Dimensions represent “Meaning”: Instead of simple X, Y, Z axes, an LLM embedding might have 1,536 dimensions. One dimension might represent “fruitiness,” another “technology,” another “altitude.”
- Fuzzy Searching: You didn’t search for the word “Apple.” You searched for a vector coordinate. Even if your content was “Red Crunchy Fruit” and your query was “Granny Smith,” the vectors would be mathematically close because the model knows they occupy the same “semantic space.”
- Ranking: Notice that
pgvectorsorted the results perfectly. In an app, you would take the top 3 results and feed thatcontentto an LLM as context.
Summary
You just performed a Nearest Neighbor Search. You proved that you can find relevant data based on its “closeness” in a multi-dimensional space rather than looking for exact string matches.