Khandelwal
Khandelwal

Reputation: 661

Making Postgres Query Faster. More Indexes?

I'm running Geodjango/Postgres 9.1/PostGIS and I'm trying to get the following query (and others like it) to run faster.

[query snipped for brevity]

SELECT "crowdbreaks_incomingkeyword"."keyword_id"
       , COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt" 
  FROM "crowdbreaks_incomingkeyword"
 INNER JOIN "crowdbreaks_tweet"
       ON ("crowdbreaks_incomingkeyword"."tweet_id"
          = "crowdbreaks_tweet"."tweet_id")
  LEFT OUTER JOIN "crowdbreaks_place"
    ON ("crowdbreaks_tweet"."place_id"
       = "crowdbreaks_place"."place_id") 
 WHERE (("crowdbreaks_tweet"."coordinates"
        @ ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008@'::bytea)
       OR ST_Overlaps("crowdbreaks_place"."bounding_box"
                     , ST_GeomFromEWKB(E'\\001...00\\000\\0008@'::bytea)
       )) 
   AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
   AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' ) 
 GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
         , "crowdbreaks_incomingkeyword"."keyword_id"
    ;

Here is what the crowdbreaks_tweet table looks like:

\d+ crowdbreaks_tweet;
                       Table "public.crowdbreaks_tweet"
    Column     |           Type           | Modifiers | Storage  | Description 
---------------+--------------------------+-----------+----------+-------------
 tweet_id      | bigint                   | not null  | plain    | 
 tweeter       | bigint                   | not null  | plain    | 
 text          | text                     | not null  | extended | 
 created_at    | timestamp with time zone | not null  | plain    | 
 country_code  | character varying(3)     |           | extended | 
 place_id      | character varying(32)    |           | extended | 
 coordinates   | geometry                 |           | main     | 
Indexes:
    "crowdbreaks_tweet_pkey" PRIMARY KEY, btree (tweet_id)
    "crowdbreaks_tweet_coordinates_id" gist (coordinates)
    "crowdbreaks_tweet_created_at" btree (created_at)
    "crowdbreaks_tweet_place_id" btree (place_id)
    "crowdbreaks_tweet_place_id_like" btree (place_id varchar_pattern_ops)
Check constraints:
    "enforce_dims_coordinates" CHECK (st_ndims(coordinates) = 2)
    "enforce_geotype_coordinates" CHECK (geometrytype(coordinates) = 'POINT'::text OR coordinates IS NULL)
    "enforce_srid_coordinates" CHECK (st_srid(coordinates) = 4326)
Foreign-key constraints:
    "crowdbreaks_tweet_place_id_fkey" FOREIGN KEY (place_id) REFERENCES crowdbreaks_place(place_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "crowdbreaks_incomingkeyword" CONSTRAINT "crowdbreaks_incomingkeyword_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "crowdbreaks_tweetanswer" CONSTRAINT "crowdbreaks_tweetanswer_tweet_id_id_fkey" FOREIGN KEY (tweet_id_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

And here is the explain analyze for the query:

 HashAggregate  (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
   ->  Hash Join  (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
         Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
         ->  Seq Scan on crowdbreaks_incomingkeyword  (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
         ->  Hash  (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
               Buckets: 4096  Batches: 4  Memory Usage: 32kB
               ->  Hash Left Join  (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
                     Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
                     Filter: ((crowdbreaks_tweet.coordinates @ '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
                     ->  Bitmap Heap Scan on crowdbreaks_tweet  (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
                           Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                           ->  Bitmap Index Scan on crowdbreaks_tweet_crreated_at  (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
                                 Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                     ->  Hash  (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
                           Buckets: 1024  Batches: 4  Memory Usage: 259kB
                           ->  Seq Scan on crowdbreaks_place  (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
 Total runtime: 6381.903 ms
(17 rows)

That's a pretty bad runtime for the query. Ideally, I'd like to get results back in a second or two.

I've increased shared_buffers on Postgres to 2GB (I have 8GB of RAM) but other than that I'm not quite sure what to do. What are my options? Should I do fewer joins? Are there any other indexes I can throw on there? The sequential scan on crowdbreaks_incomingkeyword doesn't make sense to me. It's a table of foreign keys to other tables, and thus has indexes on it.

Upvotes: 3

Views: 2719

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658767

Judging from your comment I would try two things:

  • Raise statistics target for involved columns (and run ANALYZE).

    ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;
    

The data distribution may be uneven. A bigger sample may provide the query planner with more accurate estimates.

  • Play with the cost settings in postgresql.conf. Your sequential scans might need to be more expensive compared to your index scans to give good estimates.

Try to lower the cost for cpu_index_tuple_cost and set effective_cache_size to something as high as three quaters of your total RAM for a dedicated DB server.

Upvotes: 5

Related Questions