Will Calderwood
Will Calderwood

Reputation: 4636

Why is the pgvector index not being used?

I have the following query using PostgreSQL and pgvector

I've created an index on the vector column

CREATE INDEX ON news_items_embedding USING hnsw (embedding_open_ai_large halfvec_cosine_ops)

The query is

SELECT embedding_open_ai_large <=> '[-0.048431396,0.0135269165,...]' AS distance, news_items.id
FROM news_items inner join news_items_embedding on news_items_embedding.id = news_items.id
where news_items.publish_time >= '2024-02-12T23:01:00.000Z'
and news_items.publish_time < '2024-02-26T23:01:00.000Z'
and news_items.verified_state >= 1
and embedding_open_ai_large <=> '[-0.048431396,0.0135269165,...]' < 0.2
order by embedding_open_ai_large <=> '[-0.048431396,0.0135269165,...]'
limit 100

If I remove the line and news_items.publish_time < '2024-02-26T23:01:00.000Z' then the index on the vector column is used, but with that line in there it does a table scan on the vector column instead taking 30x longer.

Why is a table scan being used with that line in place? How can I force/encourage the use of the index by restructuring the query.

Below is the plan excluding and news_items.publish_time < '2024-02-26T23:01:00.000Z'

"Limit  (cost=320.89..698.43 rows=100 width=40) (actual time=1.031..1.259 rows=1 loops=1)"
"  Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"  Buffers: shared hit=838"
"  ->  Nested Loop  (cost=320.89..5074.20 rows=1259 width=40) (actual time=1.030..1.258 rows=1 loops=1)"
"        Output: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"        Inner Unique: true"
"        Buffers: shared hit=838"
"        ->  Index Scan using news_items_embedding_embedding_open_ai_large_idx on public.news_items_embedding  (cost=320.60..748.90 rows=3207 width=22) (actual time=1.020..1.245 rows=2 loops=1)"
"              Output: news_items_embedding.id, news_items_embedding.embedding_open_ai_large"
"              Order By: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)"
"              Filter: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec) < '0.19999999999999996'::double precision)"
"              Rows Removed by Filter: 38"
"              Buffers: shared hit=829"
"        ->  Index Scan using news_items_pkey on public.news_items  (cost=0.29..1.35 rows=1 width=32) (actual time=0.003..0.003 rows=0 loops=2)"
"              Output: news_items.id, news_items.url, news_items.s3_image_path, news_items.s3_image_format, news_items.s3_image_width, news_items.s3_image_height, news_items.publish_time, news_items.verified_time, news_items.ai_title, news_items.ai_summary_short, news_items.verified_state, news_items.tag_time, news_items.sentiment, news_items.last_click_time, news_items.newsletter_referrals, news_items.website_referrals, news_items.popularity, news_items.source_name, news_items.tags, news_items.region_tags, news_items.reject_tags, news_items.news_industries, news_items.business_categories"
"              Index Cond: (news_items.id = news_items_embedding.id)"
"              Filter: ((news_items.publish_time >= '2024-02-12 23:01:00+00'::timestamp with time zone) AND (news_items.verified_state >= 1))"
"              Rows Removed by Filter: 0"
"              Buffers: shared hit=6"
"Planning:"
"  Buffers: shared hit=15"
"Planning Time: 0.187 ms"
"Execution Time: 1.279 ms"

Below in the plan including and news_items.publish_time < '2024-02-26T23:01:00.000Z'

"Limit  (cost=726.45..726.70 rows=100 width=40) (actual time=35.176..35.182 rows=1 loops=1)"
"  Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"  Buffers: shared hit=30663"
"  ->  Sort  (cost=726.45..726.70 rows=103 width=40) (actual time=35.175..35.181 rows=1 loops=1)"
"        Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"        Sort Key: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec))"
"        Sort Method: quicksort  Memory: 25kB"
"        Buffers: shared hit=30663"
"        ->  Hash Join  (cost=499.03..723.00 rows=103 width=40) (actual time=34.885..35.175 rows=1 loops=1)"
"              Output: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"              Inner Unique: true"
"              Hash Cond: (news_items_embedding.id = news_items.id)"
"              Buffers: shared hit=30663"
"              ->  Seq Scan on public.news_items_embedding  (cost=0.00..215.30 rows=3207 width=22) (actual time=6.648..34.778 rows=2 loops=1)"
"                    Output: news_items_embedding.id, news_items_embedding.embedding_open_ai_large"
"                    Filter: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec) < '0.19999999999999996'::double precision)"
"                    Rows Removed by Filter: 9618"
"                    Buffers: shared hit=29981"
"              ->  Hash  (cost=495.14..495.14 rows=311 width=32) (actual time=0.375..0.377 rows=408 loops=1)"
"                    Output: news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"                    Buckets: 1024  Batches: 1  Memory Usage: 35kB"
"                    Buffers: shared hit=679"
"                    ->  Index Scan using news_items_publish_time_index on public.news_items  (cost=0.29..495.14 rows=311 width=32) (actual time=0.007..0.336 rows=408 loops=1)"
"                          Output: news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"                          Index Cond: ((news_items.publish_time >= '2024-02-12 23:01:00+00'::timestamp with time zone) AND (news_items.publish_time < '2024-02-26 23:01:00+00'::timestamp with time zone))"
"                          Filter: (news_items.verified_state >= 1)"
"                          Rows Removed by Filter: 279"
"                          Buffers: shared hit=679"
"Planning:"
"  Buffers: shared hit=15"
"Planning Time: 0.179 ms"
"Execution Time: 35.205 ms"

Below is the plan with SET ENABLE_SEQSCAN = OFF;

"Limit  (cost=994.47..994.72 rows=100 width=40) (actual time=2.256..2.258 rows=1 loops=1)"
"  Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"  Buffers: shared hit=3253"
"  ->  Sort  (cost=994.47..994.73 rows=103 width=40) (actual time=2.256..2.257 rows=1 loops=1)"
"        Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"        Sort Key: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec))"
"        Sort Method: quicksort  Memory: 25kB"
"        Buffers: shared hit=3253"
"        ->  Nested Loop  (cost=0.57..991.03 rows=103 width=40) (actual time=1.200..2.252 rows=1 loops=1)"
"              Output: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"              Inner Unique: true"
"              Buffers: shared hit=3253"
"              ->  Index Scan using news_items_publish_time_index on public.news_items  (cost=0.29..495.14 rows=311 width=32) (actual time=0.010..0.379 rows=408 loops=1)"
"                    Output: news_items.id, news_items.url, news_items.s3_image_path, news_items.s3_image_format, news_items.s3_image_width, news_items.s3_image_height, news_items.publish_time, news_items.verified_time, news_items.ai_title, news_items.ai_summary_short, news_items.verified_state, news_items.tag_time, news_items.sentiment, news_items.last_click_time, news_items.newsletter_referrals, news_items.website_referrals, news_items.popularity, news_items.source_name, news_items.tags, news_items.region_tags, news_items.reject_tags, news_items.news_industries, news_items.business_categories"
"                    Index Cond: ((news_items.publish_time >= '2024-02-12 23:01:00+00'::timestamp with time zone) AND (news_items.publish_time < '2024-02-26 23:01:00+00'::timestamp with time zone))"
"                    Filter: (news_items.verified_state >= 1)"
"                    Rows Removed by Filter: 279"
"                    Buffers: shared hit=679"
"              ->  Index Scan using news_items_embedding_pkey on public.news_items_embedding  (cost=0.29..1.59 rows=1 width=22) (actual time=0.004..0.004 rows=0 loops=408)"
"                    Output: news_items_embedding.id, news_items_embedding.embedding_open_ai_large"
"                    Index Cond: (news_items_embedding.id = news_items.id)"
"                    Filter: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec) < '0.19999999999999996'::double precision)"
"                    Rows Removed by Filter: 1"
"                    Buffers: shared hit=2571"
"Settings: enable_seqscan = 'off'"
"Planning:"
"  Buffers: shared hit=15"
"Planning Time: 0.179 ms"
"Execution Time: 2.278 ms"

After setting random_page_cost = '1.1'.

Query with where...

"Limit  (cost=595.70..595.95 rows=100 width=40) (actual time=2.701..2.702 rows=1 loops=1)"
"  Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"  Buffers: shared hit=3267"
"  ->  Sort  (cost=595.70..595.95 rows=103 width=40) (actual time=2.700..2.701 rows=1 loops=1)"
"        Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"        Sort Key: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec))"
"        Sort Method: quicksort  Memory: 25kB"
"        Buffers: shared hit=3267"
"        ->  Nested Loop  (cost=0.57..592.25 rows=103 width=40) (actual time=1.493..2.695 rows=1 loops=1)"
"              Output: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"              Inner Unique: true"
"              Buffers: shared hit=3267"
"              ->  Index Scan using news_items_publish_time_index on public.news_items  (cost=0.29..375.67 rows=310 width=32) (actual time=0.011..0.484 rows=408 loops=1)"
"                    Output: news_items.id, news_items.url, news_items.s3_image_path, news_items.s3_image_format, news_items.s3_image_width, news_items.s3_image_height, news_items.publish_time, news_items.verified_time, news_items.ai_title, news_items.ai_summary_short, news_items.verified_state, news_items.tag_time, news_items.sentiment, news_items.last_click_time, news_items.newsletter_referrals, news_items.website_referrals, news_items.popularity, news_items.source_name, news_items.tags, news_items.region_tags, news_items.reject_tags, news_items.news_industries, news_items.business_categories"
"                    Index Cond: ((news_items.publish_time >= '2024-02-12 23:01:00+00'::timestamp with time zone) AND (news_items.publish_time < '2024-02-26 23:01:00+00'::timestamp with time zone))"
"                    Filter: (news_items.verified_state >= 1)"
"                    Rows Removed by Filter: 279"
"                    Buffers: shared hit=700"
"              ->  Index Scan using news_items_embedding_pkey on public.news_items_embedding  (cost=0.29..0.70 rows=1 width=22) (actual time=0.005..0.005 rows=0 loops=408)"
"                    Output: news_items_embedding.id, news_items_embedding.embedding_open_ai_large"
"                    Index Cond: (news_items_embedding.id = news_items.id)"
"                    Filter: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec) < '0.19999999999999996'::double precision)"
"                    Rows Removed by Filter: 1"
"                    Buffers: shared hit=2564"
"Settings: random_page_cost = '1.1'"
"Planning:"
"  Buffers: shared hit=16"
"Planning Time: 0.209 ms"
"Execution Time: 2.725 ms"

Query without where...

"Limit  (cost=89.98..274.19 rows=100 width=40) (actual time=0.923..1.135 rows=1 loops=1)"
"  Output: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"  Buffers: shared hit=825"
"  ->  Nested Loop  (cost=89.98..2409.10 rows=1259 width=40) (actual time=0.922..1.133 rows=1 loops=1)"
"        Output: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec), news_items.id, news_items.verified_state, news_items.source_name, news_items.publish_time"
"        Inner Unique: true"
"        Buffers: shared hit=825"
"        ->  Index Scan using news_items_embedding_embedding_open_ai_large_idx on public.news_items_embedding  (cost=89.70..312.10 rows=3207 width=22) (actual time=0.900..1.107 rows=2 loops=1)"
"              Output: news_items_embedding.id, news_items_embedding.embedding_open_ai_large"
"              Order By: (news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec)"
"              Filter: ((news_items_embedding.embedding_open_ai_large <=> '[-0.048431396,...]'::halfvec) < '0.19999999999999996'::double precision)"
"              Rows Removed by Filter: 38"
"              Buffers: shared hit=816"
"        ->  Index Scan using news_items_pkey on public.news_items  (cost=0.29..0.65 rows=1 width=32) (actual time=0.004..0.004 rows=0 loops=2)"
"              Output: news_items.id, news_items.url, news_items.s3_image_path, news_items.s3_image_format, news_items.s3_image_width, news_items.s3_image_height, news_items.publish_time, news_items.verified_time, news_items.ai_title, news_items.ai_summary_short, news_items.verified_state, news_items.tag_time, news_items.sentiment, news_items.last_click_time, news_items.newsletter_referrals, news_items.website_referrals, news_items.popularity, news_items.source_name, news_items.tags, news_items.region_tags, news_items.reject_tags, news_items.news_industries, news_items.business_categories"
"              Index Cond: (news_items.id = news_items_embedding.id)"
"              Filter: ((news_items.publish_time >= '2024-02-12 23:01:00+00'::timestamp with time zone) AND (news_items.verified_state >= 1))"
"              Rows Removed by Filter: 0"
"              Buffers: shared hit=6"
"Settings: random_page_cost = '1.1'"
"Planning:"
"  Buffers: shared hit=16"
"Planning Time: 0.222 ms"
"Execution Time: 1.158 ms"

Upvotes: 2

Views: 222

Answers (0)

Related Questions