Pgvector

Table of Contents

pgvector is an open-source extension for PostgreSQL that allows you to store, query, and index vector embeddings directly within your database.

How it Works:

  1. Storage: It introduces a new data type, vector. You can store arrays of numbers (embeddings generated by AI models like OpenAI, Hugging Face, or Llama) in standard table columns.
  2. Distance Metrics: It provides built-in operators to calculate similarity between vectors using:
    • Cosine distance (<->): Best for most AI embeddings.
    • L2 distance (<->): Euclidean distance.
    • Inner product (<#>).
  3. Indexing: To keep searches fast as data grows, it uses approximate nearest neighbor (ANN) algorithms:
    • HNSW (Hierarchical Navigable Small Worlds): High performance and great recall.
    • IVFFlat: Divides vectors into lists for faster scanning.
  4. Integration: Since it’s just a Postgres extension, you can combine vector searches with standard SQL features like WHERE clauses, joins, and ACID transactions in a single query.

Setup

Pull down container and spin it up

docker pull pgvector/pgvector:pg18
docker run -d --name pgvector -e POSTGRES_PASSWORD=postgres -p 5432:5432 pgvector/pgvector:pg18

Create the extension

docker exec pgvector psql -U postgres -c "CREATE EXTENSION IF NOT EXISTS vector;"

CREATE EXTENSION

Playing Around For Understanding

Create Table

docker exec pgvector psql -U postgres -c "
CREATE TABLE IF NOT EXISTS items (
  id serial PRIMARY KEY,
  content text,
  embedding vector(3)
);"

CREATE TABLE

Insert Some Sample Vectors

docker exec pgvector psql -U postgres -c "
INSERT INTO items (content, embedding) VALUES 
('Apple', '[1, 0, 0]'),
('Banana', '[1, 1, 0]'),
('SpaceShuttle', '[0, 0, 1]')
ON CONFLICT DO NOTHING;"

INSERT 0 3

Query for similarity

docker exec pgvector psql -U postgres -t -A -c "
SELECT content, 1 - (embedding <=> '[0.9, 0.1, 0.1]') AS similarity
FROM items
ORDER BY similarity DESC;"
Apple 0.9878783591404701
Banana 0.7761505613783749
SpaceShuttle 0.10976426667009398

Understanding the result

The results demonstrate Semantic Similarity versus Keyword Matching. Here is the breakdown:

1. The Numbers (The Math)

The result of 1 - (embedding <=> '[query]') converts “Cosine Distance” into a Similarity Score:

  • 1.0: Perfect match (pointing in the exact same direction).
  • 0.0: Orthogonal (completely unrelated/perpendicular).
  • Negative: Pointing in opposite directions.

2. Physical Interpretation

Imagine your data points in a 3D room:

  • Your query was [0.9, 0.1, 0.1]. This is almost entirely on the X-axis.
  • Apple ([1, 0, 0]) is also sitting strictly on the X-axis. Because they “point” in nearly the same direction, the score is very high (0.98).
  • Banana ([1, 1, 0]) is at a 45-degree angle between X and Y. It’s “close-ish” but pulling away from the X-axis, resulting in a lower score (0.77).
  • Space Shuttle ([0, 0, 1]) is on the Z-axis. It is perpendicular to your query. The score is very low (0.10).

3. The Significance for AI

In a real-world RAG (Retrieval-Augmented Generation) system:

  • The Dimensions represent “Meaning”: Instead of simple X, Y, Z axes, an LLM embedding might have 1,536 dimensions. One dimension might represent “fruitiness,” another “technology,” another “altitude.”
  • Fuzzy Searching: You didn’t search for the word “Apple.” You searched for a vector coordinate. Even if your content was “Red Crunchy Fruit” and your query was “Granny Smith,” the vectors would be mathematically close because the model knows they occupy the same “semantic space.”
  • Ranking: Notice that pgvector sorted the results perfectly. In an app, you would take the top 3 results and feed that content to an LLM as context.

Summary

You just performed a Nearest Neighbor Search. You proved that you can find relevant data based on its “closeness” in a multi-dimensional space rather than looking for exact string matches.

Created: 2025-12-20 Sat 17:00