1awuesterose
1awuesterose

Reputation: 557

Unpredictable, bad performance of vector similarity search in Postgres database with `pgvector`

Problem

I have a Postgres query in the context of a Retrieval Augmented Generation (RAQ) application that is wrapped in a database function which shows poor, unpredictable and varying performance. I want to find out why the performance is not consistent and possibly optimise the query. The code is open-source and hosted on Github.

Code

The database function is defined as follows:

CREATE OR REPLACE FUNCTION public.match_document_chunks (embedding vector, match_threshold double precision, match_count integer, num_probes integer)
RETURNS TABLE (id integer, processed_document_id integer, content text, similarity double precision)
LANGUAGE plpgsql
AS $function$
#variable_conflict use_variable
BEGIN
EXECUTE format('SET LOCAL ivfflat.probes = %s', num_probes);
RETURN query
SELECT
    processed_document_chunks.id,
    processed_document_chunks.processed_document_id,
    processed_document_chunks.content,
    (processed_document_chunks.embedding < #> embedding) * -1 as similarity
    FROM
        processed_document_chunks
    WHERE (processed_document_chunks.embedding < #> embedding) * -1 > match_threshold
    ORDER BY
        processed_document_chunks.embedding < #> embedding
    LIMIT match_count;
END;
$function$

The table structure of "processed_document_chunks" is:

CREATE TABLE public.processed_document_chunks(
    id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    content text not null,
    embedding vector(1536) not null,
    page integer not null,
    chunk_index integer not null,
    processed_document_id integer REFERENCES public.processed_documents(id) ON DELETE    CASCADE
);

The table contains ~150.000 rows. The embedding column is a 1536 dimension vector using pgvector.

I have created indices on the embedding column, following the pgvector docs:

DO $$
DECLARE 
  index_name TEXT;
  numRows INT;

BEGIN
-- Delete old embedding indices first
FOR index_name IN
    SELECT indexname FROM pg_indexes WHERE indexname LIKE '%processed_document_chunks_embedding_idx%'
LOOP
    EXECUTE 'DROP INDEX IF EXISTS ' || index_name;
END LOOP;

-- Generate new embedding indices
SELECT ROUND(COUNT(*) / 1000) INTO numRows FROM processed_document_chunks;

EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_ip_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = ' || numRows || ')';
END $$;

Architecture

PostgreSQL 15.1 (Ubuntu 15.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit

Hosted on Supabase, Central EU (Frankfurt), AWS, t4g.small

Usage

In the application, a user query is embedded into a 1536 dimension vector and sent with the following parameters to the db function match_processed_document_chunks.

embedding = [0.037683565,-0.010703518,0.00831142, ...]

match_threshold = 0.85

match_count = 128

num_probes = sqrt(num_rows / 1000) = 12 (according to docs)

My findings

enter image description here

My questions

EDIT #1:

For an exemplary query, this is the explain(analyze, verbose, buffers, settings) output:

Limit  (cost=12784.02..12812.40 rows=128 width=1209) (actual time=407.891..475.874 rows=128 loops=1)
"  Output: id, processed_document_id, content, (((embedding <#> embedding) * '-1'::double precision)), ((embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector))"
  Buffers: shared hit=2557 read=17213
  ->  Index Scan using processed_document_chunks_embedding_idx1 on public.processed_document_chunks  (cost=12784.02..24015.56 rows=50654 width=1209) (actual time=407.889..475.816 rows=128 loops=1)
"        Output: id, processed_document_id, content, ((embedding <#> embedding) * '-1'::double precision), (embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
"        Order By: (processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
"        Filter: (((processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector) * '-1'::double precision) > '0.85'::double precision)"
        Buffers: shared hit=2557 read=17213
"Settings: effective_cache_size = '1536MB', effective_io_concurrency = '200', jit = 'off', max_parallel_workers = '2', max_parallel_workers_per_gather = '1', random_page_cost = '1.1', search_path = '""\$user"", public, extensions', work_mem = '5MB'"
Query Identifier: 4838635160876957833
Planning:
  Buffers: shared hit=19
Planning Time: 0.202 ms
Execution Time: 475.984 ms

EDIT #2 / Solution:

Seems like there was nothing wrong with the query. The inconsistent performance was caused by the compute instance in the cloud. I have benchmarked the search on various compute instances offered by Supabase. Seems like the solution is Throw Money at The Problem. The tested compute instances can be looked up here: https://supabase.com/docs/guides/platform/compute-add-ons

average db search time

Upvotes: 0

Views: 306

Answers (0)

Related Questions