PostgreSQL for the AI Era: Using pgvector and Semantic Search

Developer & creator of pgref.dev · 12 min read

pgvectoraiembeddings

You don't need a separate vector database. PostgreSQL can do semantic search alongside your existing relational data — and the hybrid query capability is where it beats dedicated vector databases for most applications.

What pgvector actually is

pgvector is a PostgreSQL extension that adds a vector data type and distance operators. Install it, create a column of type vector(n) where n is the dimension count, and query by similarity using standard SQL.

Requirements: PostgreSQL 12 or later. pgvector 0.5.0+ recommended (HNSW indexes were added in 0.5.0).

Setting it up

CREATE EXTENSION IF NOT EXISTS vector;

-- 1536 dims for OpenAI text-embedding-3-small / ada-002
-- 768 dims for most open-source models (e5-base, bge-base, etc.)
CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  content     TEXT NOT NULL,
  category    TEXT NOT NULL,
  source_url  TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  embedding   vector(1536)
);

In practice, generate the embedding in your application and pass it as a parameter:

import openai, psycopg2

def store_document(conn, content, category, source_url):
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=content
    )
    embedding = response.data[0].embedding  # list of 1536 floats

    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO documents (content, category, source_url, embedding)"
            " VALUES (%s, %s, %s, %s)",
            (content, category, source_url, embedding)
        )
    conn.commit()

The three distance operators

`<->` — Euclidean distance (L2)

Measures straight-line distance in vector space. Use for spatial data or image similarity where magnitude matters.

`<=>` — Cosine distance

Measures the angle between vectors, ignoring magnitude. Most text embedding models are designed for cosine similarity — use this for semantic search.

SELECT content, embedding <=> '[0.1, -0.2, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;

`<#>` — Negative inner product

For normalised vectors (unit vectors), inner product equals cosine similarity. Check your model docs to know whether embeddings are normalised.

For most text search use cases, start with <=> (cosine).

Making it fast — IVFFlat and HNSW indexes

Without an index, a nearest-neighbour query scans every row in the table. This is exact but slow at scale. pgvector provides two approximate nearest-neighbour index types:

IVFFlat

Partitions the vector space into clusters and searches only the nearest clusters at query time.

-- Build after inserting data (IVFFlat needs data to learn the clusters)
-- Rule of thumb: lists = sqrt(row_count) for < 1M rows
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

SET ivfflat.probes = 10;  -- higher = more accurate, slower

HNSW

Graph-based index with better recall than IVFFlat. Slower to build, uses more memory.

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 64;  -- higher = more accurate, slower

Trade-off: IVFFlat builds faster, less memory, good for most use cases. HNSW has better recall for precision-critical applications.

The real power — hybrid queries

This is what PostgreSQL does better than dedicated vector databases like Pinecone. In Pinecone, filtering by metadata happens after the ANN search — you search for 100 nearest neighbours and then filter down, which means you may get fewer results than you asked for if many neighbours don't match your filter.

In PostgreSQL, you write a WHERE clause and the query planner handles it:

-- Semantic search within a specific category, last 30 days
SELECT
  d.id,
  d.content,
  d.source_url,
  d.embedding <=> $1 AS distance
FROM documents d
WHERE d.category = 'postgresql'
  AND d.created_at > NOW() - INTERVAL '30 days'
ORDER BY d.embedding <=> $1
LIMIT 10;

The WHERE clause runs on indexed relational columns, then vector similarity is computed on the filtered result set. You get genuinely filtered results — not post-hoc filtering on a pre-computed list. You can also combine vector search with JOINs, CTEs, and aggregates.

A practical example: find similar products

CREATE TABLE products (
  id          BIGSERIAL PRIMARY KEY,
  name        TEXT NOT NULL,
  category    TEXT NOT NULL,
  price       NUMERIC(10, 2) NOT NULL,
  description TEXT NOT NULL,
  embedding   vector(1536)
);

CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON products (category);
CREATE INDEX ON products (price);

Query: find products similar to product #42, in the same category, under €50:

WITH target AS (
  SELECT embedding, category FROM products WHERE id = 42
)
SELECT
  p.id, p.name, p.price,
  p.embedding <=> t.embedding AS similarity_distance
FROM products p
CROSS JOIN target t
WHERE p.category = t.category
  AND p.price < 50.00
  AND p.id != 42
ORDER BY similarity_distance
LIMIT 10;

This query would require multiple API calls and client-side merging in a dedicated vector database. In PostgreSQL, it's one query.

Limitations worth knowing

For datasets over a few million vectors with very high query throughput requirements, dedicated vector databases (Qdrant, Weaviate, Milvus) offer better performance — they're built exclusively for this use case. pgvector is also single-machine; if you need horizontal scaling of the vector index specifically, you're in custom territory.

The sweet spot for pgvector is what most applications actually need: tens of thousands to a few million vectors, hybrid queries combining vector similarity with relational data, and a single database to maintain. I've been running pgvector in production on a few projects. For document search over ~200,000 documents with category and date filters, query time with HNSW is 8–15ms on a mid-range server.

pgvector turns PostgreSQL into a database that handles vector search without giving up everything else PostgreSQL is good at. For most applications, that's enough — one database to run, one to maintain, no sync process between your vectors and your relational data.