Unpredictable, bad performance of vector similarity search in Postgres database with `pgvector`

Question

Problem

I have a Postgres query in the context of a Retrieval Augmented Generation (RAQ) application that is wrapped in a database function which shows poor, unpredictable and varying performance. I want to find out why the performance is not consistent and possibly optimise the query. The code is open-source and hosted on Github.

Code

The database function is defined as follows:

CREATE OR REPLACE FUNCTION public.match_document_chunks (embedding vector, match_threshold double precision, match_count integer, num_probes integer)
RETURNS TABLE (id integer, processed_document_id integer, content text, similarity double precision)
LANGUAGE plpgsql
AS $function$
#variable_conflict use_variable
BEGIN
EXECUTE format('SET LOCAL ivfflat.probes = %s', num_probes);
RETURN query
SELECT
    processed_document_chunks.id,
    processed_document_chunks.processed_document_id,
    processed_document_chunks.content,
    (processed_document_chunks.embedding < #> embedding) * -1 as similarity
    FROM
        processed_document_chunks
    WHERE (processed_document_chunks.embedding < #> embedding) * -1 > match_threshold
    ORDER BY
        processed_document_chunks.embedding < #> embedding
    LIMIT match_count;
END;
$function$

The table structure of "processed_document_chunks" is:

CREATE TABLE public.processed_document_chunks(
    id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    content text not null,
    embedding vector(1536) not null,
    page integer not null,
    chunk_index integer not null,
    processed_document_id integer REFERENCES public.processed_documents(id) ON DELETE    CASCADE
);

The table contains ~150.000 rows. The embedding column is a 1536 dimension vector using pgvector.

I have created indices on the embedding column, following the pgvector docs:

DO $$
DECLARE 
  index_name TEXT;
  numRows INT;

BEGIN
-- Delete old embedding indices first
FOR index_name IN
    SELECT indexname FROM pg_indexes WHERE indexname LIKE '%processed_document_chunks_embedding_idx%'
LOOP
    EXECUTE 'DROP INDEX IF EXISTS ' || index_name;
END LOOP;

-- Generate new embedding indices
SELECT ROUND(COUNT(*) / 1000) INTO numRows FROM processed_document_chunks;

EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_ip_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = ' || numRows || ')';
END $$;

Architecture

PostgreSQL 15.1 (Ubuntu 15.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit

Hosted on Supabase, Central EU (Frankfurt), AWS, t4g.small

Usage

In the application, a user query is embedded into a 1536 dimension vector and sent with the following parameters to the db function match_processed_document_chunks.

embedding = [0.037683565,-0.010703518,0.00831142, ...]

match_threshold = 0.85

match_count = 128

num_probes = sqrt(num_rows / 1000) = 12 (according to docs)

My findings

When two identical search queries are made consecutively, the second request is on average about 63% faster than the first one (N=30).
When two different search queries are made consecutively, there is no significant difference between the first and second request (N=30).
There is no correlation between the number of processed_document_chunks found and the response time.
Search queries with a random embedding vector take on average about 4.7 seconds, with a wide variation (see histogram, N=200).
If the WHERE (processed_document_chunks.embedding <#> embedding) * -1 > match_threshold statement is removed from the query, the response time decreases to an average of about 0.6 seconds. Since I am still doing ORDER BY and LIMIT, the query should be logically identical - or is there a difference?

My questions

Does removing the WHERE changes the logic of the query?
Why is the query with the WHERE clause so unpredictable in performance?
Why is the query without the WHERE clause much faster and more predictable?
How can I make the query perform predictable?
Do you have any other suggestion on improving performance of the query?
If the table will grow much more, e.g. double the rows, how will the query perform? What is the complexity of the query?

EDIT #1:

For an exemplary query, this is the explain(analyze, verbose, buffers, settings) output:

Limit  (cost=12784.02..12812.40 rows=128 width=1209) (actual time=407.891..475.874 rows=128 loops=1)
"  Output: id, processed_document_id, content, (((embedding <#> embedding) * '-1'::double precision)), ((embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector))"
  Buffers: shared hit=2557 read=17213
  ->  Index Scan using processed_document_chunks_embedding_idx1 on public.processed_document_chunks  (cost=12784.02..24015.56 rows=50654 width=1209) (actual time=407.889..475.816 rows=128 loops=1)
"        Output: id, processed_document_id, content, ((embedding <#> embedding) * '-1'::double precision), (embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
"        Order By: (processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
"        Filter: (((processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector) * '-1'::double precision) > '0.85'::double precision)"
        Buffers: shared hit=2557 read=17213
"Settings: effective_cache_size = '1536MB', effective_io_concurrency = '200', jit = 'off', max_parallel_workers = '2', max_parallel_workers_per_gather = '1', random_page_cost = '1.1', search_path = '""\$user"", public, extensions', work_mem = '5MB'"
Query Identifier: 4838635160876957833
Planning:
  Buffers: shared hit=19
Planning Time: 0.202 ms
Execution Time: 475.984 ms

EDIT #2 / Solution:

Seems like there was nothing wrong with the query. The inconsistent performance was caused by the compute instance in the cloud. I have benchmarked the search on various compute instances offered by Supabase. Seems like the solution is Throw Money at The Problem. The tested compute instances can be looked up here: https://supabase.com/docs/guides/platform/compute-add-ons