Reputation: 31
Currently working:
SELECT
ll."BeaconId" ,
a."Name" ,
ll."Timestamp" ,
r."Mean",
r."AbilityId"
FROM "LastLocations" AS ll
LEFT JOIN "Readings" AS r ON
r."BeaconId" = ll."BeaconId"
LEFT JOIN "BeaconAssignments" AS ba ON
ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
a."Id" = ba."AssetId"
WHERE
ll."MapId" = 200 and r."AbilityId" = 3 OR r."AbilityId" = 4
FETCH FIRST 300000 ROWS ONLY
Tried:
SELECT
COUNT(ll."BeaconId") as "BeaconId",
a."Name" ,
DATE_TRUNC('hour', ll."Timestamp") as "Timestamp",
AVG(r."Mean") as "Mean",
r."AbilityId"
FROM "LastLocations" AS ll
LEFT JOIN "Readings" AS r ON
r."BeaconId" = ll."BeaconId"
LEFT JOIN "BeaconAssignments" AS ba ON
ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
a."Id" = ba."AssetId"
WHERE
ll."MapId" = 200 AND r."AbilityId" = 3 OR r."AbilityId" = 4
GROUP BY ll."Timestamp", r."AbilityId", a."Name"
I am new to SQL in general, and I have a table that is 740mil rows long. The first query successfully loads the table into a pandas dataframe for me. However, I currently limit the number of rows fetched to 300,000 since it takes way too long otherwise.
What I am trying to do in the second query is group the data by hour essentially. This would reduce the number of rows drastically in SQL before sending over to pandas. My intent is to not have to lose data by limiting the number of rows, like I do in the first query.
Curious if anyone has any takes on how to make this more efficient, or a different approach to take?
EXPLAIN(ANALYZE,BUFFERS) output. I used Laurenz Albe improved query, but limited to 300,000 rows for sake of actually getting a result:
Limit (cost=21204954.90..21305131.45 rows=300000 width=42) (actual time=400750.745..423438.464 rows=472 loops=1)
Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
-> Finalize GroupAggregate (cost=21204954.90..24020556.97 rows=8431920 width=42) (actual time=400750.744..423438.203 rows=472 loops=1)
Group Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
-> Gather Merge (cost=21204954.90..23683280.17 rows=16863840 width=66) (actual time=400724.591..423436.392 rows=1411 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
-> Partial GroupAggregate (cost=21203954.88..21735774.58 rows=8431920 width=66) (actual time=385879.606..405937.192 rows=470 loops=3)
Group Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
-> Sort (cost=21203954.88..21275024.99 rows=28428047 width=38) (actual time=385869.894..397691.644 rows=28912575 loops=3)
Sort Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
Sort Method: external merge Disk: 1446864kB
Worker 0: Sort Method: external merge Disk: 1465512kB
Worker 1: Sort Method: external merge Disk: 1416352kB
Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
-> Hash Join (cost=2005.96..15352449.79 rows=28428047 width=38) (actual time=426.777..231374.850 rows=28912575 loops=3)
Hash Cond: (r."BeaconId" = ll."BeaconId")
Buffers: shared hit=1300485 read=9401083
-> Parallel Seq Scan on "Readings" r (cost=0.00..14601996.04 rows=104786066 width=16) (actual time=0.020..173893.042 rows=84998306 loops=3)
Filter: ("AbilityId" = ANY ('{3,4}'::integer[]))
Rows Removed by Filter: 164491074
Buffers: shared hit=1299285 read=9401083
-> Hash (cost=1996.38..1996.38 rows=766 width=26) (actual time=4.287..4.292 rows=867 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 59kB
Buffers: shared hit=1184
-> Hash Right Join (cost=1870.44..1996.38 rows=766 width=26) (actual time=2.603..4.027 rows=867 loops=3)
Hash Cond: (a."Id" = ba."AssetId")
Buffers: shared hit=1184
-> Seq Scan on "Assets" a (cost=0.00..82.14 rows=3614 width=18) (actual time=0.007..0.533 rows=3633 loops=3)
Buffers: shared hit=138
-> Hash (cost=1860.87..1860.87 rows=766 width=16) (actual time=2.570..2.573 rows=867 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 52kB
Buffers: shared hit=1046
-> Hash Right Join (cost=1792.95..1860.87 rows=766 width=16) (actual time=1.454..2.332 rows=867 loops=3)
Hash Cond: (ba."BeaconId" = ll."BeaconId")
Buffers: shared hit=1046
-> Seq Scan on "BeaconAssignments" ba (cost=0.00..58.78 rows=3478 width=8) (actual time=0.011..0.469 rows=3396 loops=3)
Buffers: shared hit=72
-> Hash (cost=1783.76..1783.76 rows=735 width=12) (actual time=0.955..0.956 rows=735 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 43kB
Buffers: shared hit=974
-> Bitmap Heap Scan on "LastLocations" ll (cost=293.98..1783.76 rows=735 width=12) (actual time=0.160..0.739 rows=735 loops=3)
Recheck Cond: ("MapId" = 200)
Heap Blocks: exact=315
Buffers: shared hit=974
-> Bitmap Index Scan on "IX_LastLocations_MapId" (cost=0.00..293.79 rows=735 width=0) (actual time=0.112..0.112 rows=735 loops=3)
Index Cond: ("MapId" = 200)
Buffers: shared hit=29
Planning Time: 0.761 ms
Execution Time: 423664.670 ms
Upvotes: 0
Views: 50
Reputation: 246298
Everything in the SELECT
list that is not in an aggregate function must also be in the GROUP BY
list. Also, you cannot use aliases in the GROUP BY
clause.
You should avoid the OR
for better performance.
Do you have an index on "LastLocations"."MapId"
? Depending on how your query is executed, an index on "Readings"."AbilityId"
might also help.
If you can, use inner joins rather than outer joins.
This is an attempt at fixing your query:
SELECT
COUNT(ll."BeaconId") as "BeaconId",
a."Name" ,
DATE_TRUNC('hour', ll."Timestamp") as "Timestamp",
AVG(r."Mean") as "Mean",
r."AbilityId"
FROM "LastLocations" AS ll
JOIN "Readings" AS r ON
r."BeaconId" = ll."BeaconId"
LEFT JOIN "BeaconAssignments" AS ba ON
ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
a."Id" = ba."AssetId"
WHERE
ll."MapId" = 200 AND r."AbilityId" IN (3, 4)
GROUP BY DATE_TRUNC('hour', ll."Timestamp"),
r."AbilityId",
a."Name";
Reading your execution plan, you could try an index on "Readings"."BeaconId"
. Other than that, you could only increase work_mem
(to speed up the sort) or get faster I/O.
Storing the data in a pre-aggregated form would help a lot.
Upvotes: 1