Josh Voytek
Josh Voytek

Reputation: 31

How to optimize this PostgresSQL query?

Currently working:

SELECT
    ll."BeaconId" ,
    a."Name" ,
    ll."Timestamp" ,
    r."Mean",
    r."AbilityId" 
FROM "LastLocations" AS ll
LEFT JOIN "Readings" AS r ON
    r."BeaconId" = ll."BeaconId" 
LEFT JOIN "BeaconAssignments" AS ba ON
    ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
    a."Id" = ba."AssetId"
WHERE
    ll."MapId" = 200 and r."AbilityId" = 3 OR r."AbilityId" = 4
FETCH FIRST 300000 ROWS ONLY

Tried:

SELECT
    COUNT(ll."BeaconId") as "BeaconId",
    a."Name" ,
    DATE_TRUNC('hour', ll."Timestamp") as "Timestamp",
    AVG(r."Mean") as "Mean",
    r."AbilityId"
FROM "LastLocations" AS ll
LEFT JOIN "Readings" AS r ON
    r."BeaconId" = ll."BeaconId" 
LEFT JOIN "BeaconAssignments" AS ba ON
    ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
    a."Id" = ba."AssetId"
WHERE
    ll."MapId" = 200 AND r."AbilityId" = 3 OR r."AbilityId" = 4
GROUP BY ll."Timestamp", r."AbilityId", a."Name"

I am new to SQL in general, and I have a table that is 740mil rows long. The first query successfully loads the table into a pandas dataframe for me. However, I currently limit the number of rows fetched to 300,000 since it takes way too long otherwise.

What I am trying to do in the second query is group the data by hour essentially. This would reduce the number of rows drastically in SQL before sending over to pandas. My intent is to not have to lose data by limiting the number of rows, like I do in the first query.

Curious if anyone has any takes on how to make this more efficient, or a different approach to take?


EXPLAIN(ANALYZE,BUFFERS) output. I used Laurenz Albe improved query, but limited to 300,000 rows for sake of actually getting a result:

Limit  (cost=21204954.90..21305131.45 rows=300000 width=42) (actual time=400750.745..423438.464 rows=472 loops=1)
  Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
  ->  Finalize GroupAggregate  (cost=21204954.90..24020556.97 rows=8431920 width=42) (actual time=400750.744..423438.203 rows=472 loops=1)
        Group Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
        Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
        ->  Gather Merge  (cost=21204954.90..23683280.17 rows=16863840 width=66) (actual time=400724.591..423436.392 rows=1411 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
              ->  Partial GroupAggregate  (cost=21203954.88..21735774.58 rows=8431920 width=66) (actual time=385879.606..405937.192 rows=470 loops=3)
                    Group Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
                    Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
                    ->  Sort  (cost=21203954.88..21275024.99 rows=28428047 width=38) (actual time=385869.894..397691.644 rows=28912575 loops=3)
                          Sort Key: (date_trunc('hour'::text, ll."Timestamp")), r."AbilityId", a."Name"
                          Sort Method: external merge  Disk: 1446864kB
                          Worker 0:  Sort Method: external merge  Disk: 1465512kB
                          Worker 1:  Sort Method: external merge  Disk: 1416352kB
                          Buffers: shared hit=1300529 read=9401083, temp read=2048429 written=2050673
                          ->  Hash Join  (cost=2005.96..15352449.79 rows=28428047 width=38) (actual time=426.777..231374.850 rows=28912575 loops=3)
                                Hash Cond: (r."BeaconId" = ll."BeaconId")
                                Buffers: shared hit=1300485 read=9401083
                                ->  Parallel Seq Scan on "Readings" r  (cost=0.00..14601996.04 rows=104786066 width=16) (actual time=0.020..173893.042 rows=84998306 loops=3)
                                      Filter: ("AbilityId" = ANY ('{3,4}'::integer[]))
                                      Rows Removed by Filter: 164491074
                                      Buffers: shared hit=1299285 read=9401083
                                ->  Hash  (cost=1996.38..1996.38 rows=766 width=26) (actual time=4.287..4.292 rows=867 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 59kB
                                      Buffers: shared hit=1184
                                      ->  Hash Right Join  (cost=1870.44..1996.38 rows=766 width=26) (actual time=2.603..4.027 rows=867 loops=3)
                                            Hash Cond: (a."Id" = ba."AssetId")
                                            Buffers: shared hit=1184
                                            ->  Seq Scan on "Assets" a  (cost=0.00..82.14 rows=3614 width=18) (actual time=0.007..0.533 rows=3633 loops=3)
                                                  Buffers: shared hit=138
                                            ->  Hash  (cost=1860.87..1860.87 rows=766 width=16) (actual time=2.570..2.573 rows=867 loops=3)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 52kB
                                                  Buffers: shared hit=1046
                                                  ->  Hash Right Join  (cost=1792.95..1860.87 rows=766 width=16) (actual time=1.454..2.332 rows=867 loops=3)
                                                        Hash Cond: (ba."BeaconId" = ll."BeaconId")
                                                        Buffers: shared hit=1046
                                                        ->  Seq Scan on "BeaconAssignments" ba  (cost=0.00..58.78 rows=3478 width=8) (actual time=0.011..0.469 rows=3396 loops=3)
                                                              Buffers: shared hit=72
                                                        ->  Hash  (cost=1783.76..1783.76 rows=735 width=12) (actual time=0.955..0.956 rows=735 loops=3)
                                                              Buckets: 1024  Batches: 1  Memory Usage: 43kB
                                                              Buffers: shared hit=974
                                                              ->  Bitmap Heap Scan on "LastLocations" ll  (cost=293.98..1783.76 rows=735 width=12) (actual time=0.160..0.739 rows=735 loops=3)
                                                                    Recheck Cond: ("MapId" = 200)
                                                                    Heap Blocks: exact=315
                                                                    Buffers: shared hit=974
                                                                    ->  Bitmap Index Scan on "IX_LastLocations_MapId"  (cost=0.00..293.79 rows=735 width=0) (actual time=0.112..0.112 rows=735 loops=3)
                                                                          Index Cond: ("MapId" = 200)
                                                                          Buffers: shared hit=29
Planning Time: 0.761 ms
Execution Time: 423664.670 ms

Upvotes: 0

Views: 50

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 246298

  • Everything in the SELECT list that is not in an aggregate function must also be in the GROUP BY list. Also, you cannot use aliases in the GROUP BY clause.

  • You should avoid the OR for better performance.

  • Do you have an index on "LastLocations"."MapId"? Depending on how your query is executed, an index on "Readings"."AbilityId" might also help.

  • If you can, use inner joins rather than outer joins.

This is an attempt at fixing your query:

SELECT
    COUNT(ll."BeaconId") as "BeaconId",
    a."Name" ,
    DATE_TRUNC('hour', ll."Timestamp") as "Timestamp",
    AVG(r."Mean") as "Mean",
    r."AbilityId"
FROM "LastLocations" AS ll
JOIN "Readings" AS r ON
    r."BeaconId" = ll."BeaconId" 
LEFT JOIN "BeaconAssignments" AS ba ON
    ba."BeaconId" = ll."BeaconId"
LEFT JOIN "Assets" AS a ON
    a."Id" = ba."AssetId"
WHERE
    ll."MapId" = 200 AND r."AbilityId" IN (3, 4)
GROUP BY DATE_TRUNC('hour', ll."Timestamp"),
         r."AbilityId",
         a."Name";

Reading your execution plan, you could try an index on "Readings"."BeaconId". Other than that, you could only increase work_mem (to speed up the sort) or get faster I/O.

Storing the data in a pre-aggregated form would help a lot.

Upvotes: 1

Related Questions