Reputation: 557
I have a Postgres query in the context of a Retrieval Augmented Generation (RAQ) application that is wrapped in a database function which shows poor, unpredictable and varying performance. I want to find out why the performance is not consistent and possibly optimise the query. The code is open-source and hosted on Github.
The database function is defined as follows:
CREATE OR REPLACE FUNCTION public.match_document_chunks (embedding vector, match_threshold double precision, match_count integer, num_probes integer)
RETURNS TABLE (id integer, processed_document_id integer, content text, similarity double precision)
LANGUAGE plpgsql
AS $function$
#variable_conflict use_variable
BEGIN
EXECUTE format('SET LOCAL ivfflat.probes = %s', num_probes);
RETURN query
SELECT
processed_document_chunks.id,
processed_document_chunks.processed_document_id,
processed_document_chunks.content,
(processed_document_chunks.embedding < #> embedding) * -1 as similarity
FROM
processed_document_chunks
WHERE (processed_document_chunks.embedding < #> embedding) * -1 > match_threshold
ORDER BY
processed_document_chunks.embedding < #> embedding
LIMIT match_count;
END;
$function$
The table structure of "processed_document_chunks" is:
CREATE TABLE public.processed_document_chunks(
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
content text not null,
embedding vector(1536) not null,
page integer not null,
chunk_index integer not null,
processed_document_id integer REFERENCES public.processed_documents(id) ON DELETE CASCADE
);
The table contains ~150.000 rows. The embedding
column is a 1536 dimension vector using pgvector.
I have created indices on the embedding
column, following the pgvector
docs:
DO $$
DECLARE
index_name TEXT;
numRows INT;
BEGIN
-- Delete old embedding indices first
FOR index_name IN
SELECT indexname FROM pg_indexes WHERE indexname LIKE '%processed_document_chunks_embedding_idx%'
LOOP
EXECUTE 'DROP INDEX IF EXISTS ' || index_name;
END LOOP;
-- Generate new embedding indices
SELECT ROUND(COUNT(*) / 1000) INTO numRows FROM processed_document_chunks;
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_ip_ops) WITH (lists = ' || numRows || ')';
EXECUTE 'CREATE INDEX ON processed_document_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = ' || numRows || ')';
END $$;
PostgreSQL 15.1 (Ubuntu 15.1-1.pgdg20.04+1) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit
Hosted on Supabase, Central EU (Frankfurt), AWS, t4g.small
In the application, a user query is embedded into a 1536 dimension vector and sent with the following parameters to the db function match_processed_document_chunks
.
embedding = [0.037683565,-0.010703518,0.00831142, ...]
match_threshold = 0.85
match_count = 128
num_probes = sqrt(num_rows / 1000)
= 12
(according to docs)
WHERE (processed_document_chunks.embedding <#> embedding) * -1 > match_threshold
statement is removed from the query, the response time decreases to an average of about 0.6 seconds. Since I am still doing ORDER BY
and LIMIT
, the query should be logically identical - or is there a difference?WHERE
changes the logic of the query?WHERE
clause so unpredictable in performance?WHERE
clause much faster and more predictable?EDIT #1:
For an exemplary query, this is the explain(analyze, verbose, buffers, settings)
output:
Limit (cost=12784.02..12812.40 rows=128 width=1209) (actual time=407.891..475.874 rows=128 loops=1)
" Output: id, processed_document_id, content, (((embedding <#> embedding) * '-1'::double precision)), ((embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector))"
Buffers: shared hit=2557 read=17213
-> Index Scan using processed_document_chunks_embedding_idx1 on public.processed_document_chunks (cost=12784.02..24015.56 rows=50654 width=1209) (actual time=407.889..475.816 rows=128 loops=1)
" Output: id, processed_document_id, content, ((embedding <#> embedding) * '-1'::double precision), (embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
" Order By: (processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector)"
" Filter: (((processed_document_chunks.embedding <#> '[0.005677505,-0.016324464,-0.009545551,…]‘::vector) * '-1'::double precision) > '0.85'::double precision)"
Buffers: shared hit=2557 read=17213
"Settings: effective_cache_size = '1536MB', effective_io_concurrency = '200', jit = 'off', max_parallel_workers = '2', max_parallel_workers_per_gather = '1', random_page_cost = '1.1', search_path = '""\$user"", public, extensions', work_mem = '5MB'"
Query Identifier: 4838635160876957833
Planning:
Buffers: shared hit=19
Planning Time: 0.202 ms
Execution Time: 475.984 ms
EDIT #2 / Solution:
Seems like there was nothing wrong with the query. The inconsistent performance was caused by the compute instance in the cloud. I have benchmarked the search on various compute instances offered by Supabase. Seems like the solution is Throw Money at The Problem. The tested compute instances can be looked up here: https://supabase.com/docs/guides/platform/compute-add-ons
Upvotes: 0
Views: 306