Reputation: 119
So I have a DB on Aurora r5.2xl and I created a duplicate db on a i3.2xl (tons of IOPS). This one query for some reason takes 10x longer on the new server than on Aurora even though all the variables are set exactly the same.
Notes:
Here is the query
explain analyze SELECT a.id,
count(b.id) as hotel_count,
array_agg(b.id) as hotel_ids
FROM autocomplete a,
hotel b
WHERE a.type = 'city'
AND st_covers(a.poly, b.coords)
AND st_distance(a.coords, b.coords) < 40000
AND b.last_seen IS NOT NULL
AND b.enabled = true
GROUP BY a.id
limit 10;
Here are the aurora results
Limit (cost=0.83..1100.64 rows=10 width=56) (actual time=0.522..2.534 rows=10 loops=1)
-> GroupAggregate (cost=0.83..3696781.84 rows=33613 width=56) (actual time=0.521..2.531 rows=10 loops=1)
Group Key: a.id
-> Nested Loop (cost=0.83..3668393.17 rows=3729135 width=32) (actual time=0.357..2.506 rows=69 loops=1)
" -> Index Scan using ""PK_5523204bb8469c2025bcb0b55bc"" on autocomplete a (cost=0.42..188879.01 rows=33613 width=176) (actual time=0.025..0.280 rows=14 loops=1)"
" Filter: (type = 'city'::autocomplete_type_enum)"
Rows Removed by Filter: 133
-> Index Scan using hotel_coords_idx on hotel b (cost=0.41..103.49 rows=3 width=48) (actual time=0.109..0.158 rows=5 loops=14)
Index Cond: (a.poly && coords)
" Filter: ((last_seen IS NOT NULL) AND enabled AND _st_covers(a.poly, coords) AND (_st_distance(a.coords, coords, '0'::double precision, true) < '40000'::double precision))"
Rows Removed by Filter: 7
Planning time: 26.210 ms
Execution time: 2.590 ms
And here's the new server results (there's no iowait but CPU burns at 100% the whole time if I set the limit higher)
Limit (cost=5215.44..28517.08 rows=10 width=56) (actual time=124.928..484.664 rows=10 loops=1)
-> GroupAggregate (cost=5215.44..80034722.29 rows=34345 width=56) (actual time=124.926..484.643 rows=10 loops=1)
Group Key: a.id
-> Gather Merge (cost=5215.44..80000159.03 rows=4551193 width=32) (actual time=89.883..559.222 rows=69 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=4215.41..79473838.41 rows=1896330 width=32) (actual time=38.777..468.633 rows=162 loops=3)
" -> Parallel Index Scan using ""PK_5523204bb8469c2025bcb0b55bc"" on autocomplete a (cost=0.42..92511.37 rows=14310 width=176) (actual time=0.017..0.159 rows=12 loops=3)"
" Filter: (type = 'city'::autocomplete_type_enum)"
Rows Removed by Filter: 84
-> Bitmap Heap Scan on hotel b (cost=4214.99..5547.18 rows=8 width=48) (actual time=37.861..37.959 rows=13 loops=37)
Recheck Cond: (last_seen IS NOT NULL)
Rows Removed by Index Recheck: 11
" Filter: (enabled AND st_covers(a.poly, coords) AND (st_distance(a.coords, coords, true) < '40000'::double precision))"
Rows Removed by Filter: 1
Heap Blocks: exact=107
-> BitmapAnd (cost=4214.99..4214.99 rows=26 width=0) (actual time=37.759..37.759 rows=0 loops=37)
-> Bitmap Index Scan on hotel_coords_idx (cost=0.00..2.60 rows=117 width=0) (actual time=0.910..0.910 rows=59 loops=37)
Index Cond: (coords && a.poly)
-> Bitmap Index Scan on hotel_last_seen_idx (cost=0.00..4149.68 rows=258652 width=0) (actual time=36.268..36.268 rows=262908 loops=37)
Planning Time: 1.815 ms
Execution Time: 559.530 ms
Any idea why this would be the case? I'm at a loss. The new server actually beats aurora on a bunch of queries including a query such as "explain analyze select name from hotel where last_seen is not null and enabled = true group by name limit 1000;" but for some reason that query above just runs so much slower
Upvotes: 2
Views: 184
Reputation: 119
Ok so I found out that PG12 was just really poorly optimizing that query no matter what I tried. When I rewrote it as such it worked just as fast.
WITH x AS
(
SELECT id, poly, coords
FROM autocomplete
WHERE type = 'city'
)
SELECT x.id,
sum(CASE WHEN last_seen IS NOT NULL THEN 1 ELSE 0 END) AS hotel_count,
array_agg(CASE WHEN last_seen IS NOT NULL THEN b.id ELSE NULL END) AS hotel_ids
FROM x, hotel b
WHERE st_covers(x.poly, b.coords)
AND st_distance(x.coords, b.coords) < 40000
AND b.enabled = true
GROUP BY x.id;
PS12 also was doing a full index scan on last_seen at every loop causing it to go forever so it was a LOT cheaper to remove the last_seen from where and just add it to the count condition for some reason
Upvotes: 2