jeyongOh
jeyongOh

Reputation: 89

Why casting to geography is faster when there are big IN condition?

I'm using AWS aurora postgresql compatible. Postgresql version is 11.7 and postgis version is 2.5

I have vehicle and vehicle_current_status table.

The vehicle table has almost 4000 rows.

The vehicle table's id column is auto-increment primary key.

The vehicle_current_status table has one-to-one relation with vehicle table.

The vehicle_current_status table's id column is auto-increment primary key.

The vehicle_current_status table's coordinate column is geometry with SRID 4326. I did not use index on coordinate column because updating coordinate query is executed a lot.

There are big IN condition with number of 2845 entries.

Query 1 (without type cast)

SELECT "v"."id" AS "v_id"
FROM "vehicle" "v"
LEFT JOIN "vehicle_current_status" "vs" ON "vs"."vehicle_id" = "v"."id"
WHERE
    ST_DWITHIN(
        "vs"."coordinate",
        ST_SETSRID(
            ST_GEOMFROMGEOJSON('{"type": "Point", "coordinates": [127.03,37.509]}'),
            4326),
        0.017)
    AND "v"."id" IN (VALUES(1023),(1006),(3674),(1692)... 2845 entries)
    AND "v".IS_ACTIVE IS TRUE
    AND "vs".BATTERY_PERCENTAGE > 30

Query 1 explain

"Nested Loop Semi Join  (cost=0.28..12330.99 rows=2 width=4) (actual time=1.118..83.764 rows=121 loops=1)"
"  Join Filter: (vs.vehicle_id = ""*VALUES*"".column1)"
"  Rows Removed by Join Filter: 578765"
"  Buffers: shared hit=11846"
"  ->  Nested Loop  (cost=0.28..12160.29 rows=3 width=8) (actual time=0.028..9.577 rows=250 loops=1)"
"        Buffers: shared hit=11846"
"        ->  Seq Scan on vehicle_current_status vs  (cost=0.00..12135.39 rows=3 width=4) (actual time=0.017..8.799 rows=250 loops=1)"
"              Filter: ((coordinate && '0103000020E6100000010000000500000046B6F3FDD4C05F40E5D022DBF9BE424046B6F3FDD4C05F4017D9CEF753C342405EBA490C02C35F4017D9CEF753C342405EBA490C02C35F40E5D022DBF9BE424046B6F3FDD4C05F40E5D022DBF9BE4240'::geometry) AND (battery_percentage > 30) AND ('0101000020E610000052B81E85EBC15F40FED478E926C14240'::geometry && st_expand(coordinate, '0.017'::double precision)) AND _st_dwithin(coordinate, '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geometry, '0.017'::double precision))"
"              Rows Removed by Filter: 3607"
"              Buffers: shared hit=11094"
"        ->  Index Scan using ""PK_187fa17ba39d367e5604b3d1ec9"" on vehicle v  (cost=0.28..8.30 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=250)"
"              Index Cond: (id = vs.vehicle_id)"
"              Filter: (is_active IS TRUE)"
"              Buffers: shared hit=752"
"  ->  Materialize  (cost=0.00..49.79 rows=2845 width=4) (actual time=0.000..0.131 rows=2316 loops=250)"
"        ->  Values Scan on ""*VALUES*""  (cost=0.00..35.56 rows=2845 width=4) (actual time=0.001..0.533 rows=2845 loops=1)"
"Planning Time: 2.045 ms"
"Execution Time: 83.853 ms"

Query 2 (with type cast to geography)

SELECT "v"."id" AS "v_id"
FROM "vehicle" "v"
LEFT JOIN "vehicle_current_status" "vs" ON "vs"."vehicle_id" = "v"."id"
WHERE
    ST_DWITHIN(
        "vs"."coordinate"::geography,
        ST_SETSRID(
            ST_GEOMFROMGEOJSON('{"type": "Point", "coordinates": [127.03,37.509]}'),
            4326)::geography,
        1800, false)
    AND "v"."id" IN (VALUES(1023),(1006),(3674),(1692)... 2845 entries)
    AND "v".IS_ACTIVE IS TRUE
    AND "vs".BATTERY_PERCENTAGE > 30

Query 2 explain

"Nested Loop  (cost=106.97..12760.97 rows=35 width=4) (actual time=1.988..13.254 rows=123 loops=1)"
"  Join Filter: (vs.vehicle_id = v.id)"
"  Buffers: shared hit=11466"
"  ->  Hash Join  (cost=106.69..12744.01 rows=35 width=8) (actual time=1.977..12.937 rows=123 loops=1)"
"        Hash Cond: (vs.vehicle_id = ""*VALUES*"".column1)"
"        Buffers: shared hit=11097"
"        ->  Seq Scan on vehicle_current_status vs  (cost=0.00..12636.80 rows=47 width=4) (actual time=0.145..11.040 rows=253 loops=1)"
"              Filter: ((battery_percentage > 30) AND ((coordinate)::geography && '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography) AND ('0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography && _st_expand((coordinate)::geography, '1800'::double precision)) AND _st_dwithin((coordinate)::geography, '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography, '1800'::double precision, true))"
"              Rows Removed by Filter: 3604"
"              Buffers: shared hit=11097"
"        ->  Hash  (cost=71.12..71.12 rows=2845 width=4) (actual time=1.809..1.809 rows=2845 loops=1)"
"              Buckets: 4096  Batches: 1  Memory Usage: 133kB"
"              ->  HashAggregate  (cost=42.67..71.12 rows=2845 width=4) (actual time=1.071..1.392 rows=2845 loops=1)"
"                    Group Key: ""*VALUES*"".column1"
"                    ->  Values Scan on ""*VALUES*""  (cost=0.00..35.56 rows=2845 width=4) (actual time=0.001..0.532 rows=2845 loops=1)"
"  ->  Index Scan using ""PK_187fa17ba39d367e5604b3d1ec9"" on vehicle v  (cost=0.28..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=123)"
"        Index Cond: (id = ""*VALUES*"".column1)"
"        Filter: (is_active IS TRUE)"
"        Buffers: shared hit=369"
"Planning Time: 2.274 ms"
"Execution Time: 13.380 ms"

This is strange. Why casting to geography is faster?

If I remove big IN condition "v"."id" IN (VALUES...), then Query 1 is faster then Query 2.

Query 1 explain (without type cast, remove big IN condition)

"Nested Loop  (cost=0.28..12531.73 rows=4 width=4) (actual time=0.023..9.378 rows=250 loops=1)"
"  Buffers: shared hit=11846"
"  ->  Seq Scan on vehicle_current_status vs  (cost=0.00..12498.54 rows=4 width=4) (actual time=0.013..8.744 rows=250 loops=1)"
"        Filter: ((coordinate && '0103000020E6100000010000000500000046B6F3FDD4C05F40E5D022DBF9BE424046B6F3FDD4C05F4017D9CEF753C342405EBA490C02C35F4017D9CEF753C342405EBA490C02C35F40E5D022DBF9BE424046B6F3FDD4C05F40E5D022DBF9BE4240'::geometry) AND (battery_percentage > 30) AND ('0101000020E610000052B81E85EBC15F40FED478E926C14240'::geometry && st_expand(coordinate, '0.017'::double precision)) AND _st_dwithin(coordinate, '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geometry, '0.017'::double precision))"
"        Rows Removed by Filter: 3607"
"        Buffers: shared hit=11094"
"  ->  Index Scan using ""PK_187fa17ba39d367e5604b3d1ec9"" on vehicle v  (cost=0.28..8.30 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=250)"
"        Index Cond: (id = vs.vehicle_id)"
"        Filter: (is_active IS TRUE)"
"        Buffers: shared hit=752"
"Planning Time: 0.347 ms"
"Execution Time: 9.415 ms"

Query 2 explain (with type cast to geography, remove big IN condition)

"Nested Loop  (cost=0.28..12886.79 rows=47 width=4) (actual time=0.122..13.833 rows=253 loops=1)"
"  Buffers: shared hit=11858"
"  ->  Seq Scan on vehicle_current_status vs  (cost=0.00..12636.80 rows=47 width=4) (actual time=0.114..13.037 rows=253 loops=1)"
"        Filter: ((battery_percentage > 30) AND ((coordinate)::geography && '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography) AND ('0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography && _st_expand((coordinate)::geography, '1800'::double precision)) AND _st_dwithin((coordinate)::geography, '0101000020E610000052B81E85EBC15F40FED478E926C14240'::geography, '1800'::double precision, true))"
"        Rows Removed by Filter: 3604"
"        Buffers: shared hit=11097"
"  ->  Index Scan using ""PK_187fa17ba39d367e5604b3d1ec9"" on vehicle v  (cost=0.28..5.32 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=253)"
"        Index Cond: (id = vs.vehicle_id)"
"        Filter: (is_active IS TRUE)"
"        Buffers: shared hit=761"
"Planning Time: 0.348 ms"
"Execution Time: 13.880 ms"

Why casting to geography is faster when there are big IN condition?

Upvotes: 0

Views: 75

Answers (1)

jjanes
jjanes

Reputation: 44373

When it thinks it only has to search the in-list 3 times, it didn't seem worthwhile preprocessing it into a hash table. That turns out to be a mistake, as it actually needs to search it 250 times.

When you cast it, it then thinks it will have to search the in-list 47 times. Which while still wrong is much closer to reality and leads to a better plan.

Why does casting give a different row estimate? No idea. Maybe geometry versus geography? If you want to look into that you should simplify the query to get rid of the join and the criterion on battery_percentage to focus just on the spatial aspect.

Upvotes: 0

Related Questions