Struggling to optimize rails WHERE NOT IN query in Rails

Question

Here's the query as it is in rails:

User.limit(20).
  where.not(id: to_skip, number_of_photos: 0).
  where(age: @user.seeking_age_min..@user.seeking_age_max).
  tagged_with(@user.seeking_traits, on: :trait, any: true).
  tagged_with(@user.seeking_gender, on: :trait, any: true).ids

And here's the output of EXPLAIN ANALYZE. Note the id <> ALL(...) part is shortened. There are around 10K ids in it.

Limit  (cost=23.32..5331.16 rows=20 width=1698) (actual time=2237.871..2243.709 rows=20 loops=1)
  ->  Nested Loop Semi Join  (cost=23.32..875817.48 rows=3300 width=1698) (actual time=2237.870..2243.701 rows=20 loops=1)
        ->  Merge Semi Join  (cost=22.89..857813.95 rows=8311 width=1702) (actual time=463.757..2220.691 rows=1351 loops=1)
              Merge Cond: (users.id = users_trait_taggings_356a192.taggable_id)
              ->  Index Scan using users_pkey on users  (cost=0.29..834951.51 rows=37655 width=1698) (actual time=455.122..2199.322 rows=7866 loops=1)
                    Index Cond: (id IS NOT NULL)
                    Filter: ((number_of_photos <> 0) AND (age >= 18) AND (age <= 99) AND (id <> ALL ('{7066,7065,...,15624,23254}'::integer[])))
                    Rows Removed by Filter: 7652
              ->  Index Only Scan using taggings_idx on taggings users_trait_taggings_356a192  (cost=0.42..22767.59 rows=11393 width=4) (actual time=0.048..16.009 rows=4554 loops=1)
                    Index Cond: ((tag_id = 2) AND (taggable_type = 'User'::text) AND (context = 'trait'::text))
                    Heap Fetches: 4554
        ->  Index Scan using index_taggings_on_taggable_id_and_taggable_type_and_context on taggings users_trait_taggings_5df4b2a  (cost=0.42..2.16 rows=1 width=4) (actual time=0.016..0.016 rows=0 loops=1351)
              Index Cond: ((taggable_id = users.id) AND ((taggable_type)::text = 'User'::text) AND ((context)::text = 'trait'::text))
              Filter: (tag_id = ANY ('{4,6}'::integer[]))
              Rows Removed by Filter: 2
Total runtime: 2243.913 ms

Complete version here.

It seems like there's something wrong with Index Scan using users_pkey on users where the index scan is taking a very long time. Even though there's an index on age, number_of_photos and the id:

add_index "users", ["age"], name: "index_users_on_age", using: :btree
add_index "users", ["number_of_photos"], name: "index_users_on_number_of_photos", using: :btree

to_skip is an array of user ids to not skip. A user has many skips. Each skip has a partner_id.

So to fetch to_skip I'm doing:

to_skip = @user.skips.pluck(:partner_id)

I tried to isolate the query to just:

sql = User.limit(20).
  where.not(id: to_skip, number_of_photos: 0).
  where(age: @user.seeking_age_min..@user.seeking_age_max).to_sql

And still getting the same problem with the explain analyze. Again, list of user ids is snipped:

Limit  (cost=0.00..435.34 rows=20 width=1698) (actual time=0.219..4.844 rows=20 loops=1)
  ->  Seq Scan on users  (cost=0.00..819629.38 rows=37655 width=1698) (actual time=0.217..4.838 rows=20 loops=1)
        Filter: ((id IS NOT NULL) AND (number_of_photos <> 0) AND (age >= 18) AND (age <= 99) AND (id <> ALL ('{7066,7065,...,15624,23254}'::integer[])))
        Rows Removed by Filter: 6
Total runtime: 5.044 ms

Complete version here.

Any thoughts on how I can optimize this query in rails + postgres?

EDIT: Here are the relevant models:

User model

class User < ActiveRecord::Base
  acts_as_messageable required: :body, # default [:topic, :body]
                      dependent: :destroy

  has_many :skips, :dependent => :destroy

  acts_as_taggable # Alias for acts_as_taggable_on :tags
  acts_as_taggable_on :seeking_gender, :trait, :seeking_race
  scope :by_updated_date, -> {
    order("updated_at DESC")
  }
end

# schema

create_table "users", force: :cascade do |t|
  t.string   "email", default: "", null: false
  t.datetime "created_at", null: false
  t.datetime "updated_at", null: false
  t.text     "skips", array: true
  t.integer  "number_of_photos", default: 0
  t.integer  "age"
end

add_index "users", ["age"], name: "index_users_on_age", using: :btree
add_index "users", ["email"], name: "index_users_on_email", unique: true, using: :btree
add_index "users", ["number_of_photos"], name: "index_users_on_number_of_photos", using: :btree
add_index "users", ["updated_at"], name: "index_users_on_updated_at", order: {"updated_at"=>:desc}, using: :btree

Skips model

class Skip < ActiveRecord::Base
  belongs_to :user
end

# schema

create_table "skips", force: :cascade do |t|
  t.integer  "user_id"
  t.integer  "partner_id"
  t.datetime "created_at", null: false
  t.datetime "updated_at", null: false
end

add_index "skips", ["partner_id"], name: "index_skips_on_partner_id", using: :btree
add_index "skips", ["user_id"], name: "index_skips_on_user_id", using: :btree

Nic Nilov · Accepted Answer

The speed issue is likely due to the long list of ids in to_skip (about 60Kb) getting passed in as array. The solution then would be to rework it to be a result of a subquery so postgress could better optimize the query.

When building to_skip, try using select instead of pluck. pluck returns an array which you then pass to main query. select, in turn, returns ActiveRecord::Relation, the sql of which can be included with the main query, potentially making it more efficient.

to_skip = @user.skips.select(:partner_id)

Until your models code is posted it's hard to make more specific suggestions. The general direction I'd explore would be to try and merge all the relevant steps into a single query to let the database do the optimizations.

UPDATE

The Active Record query using select would look something like this (I skipped taggable stuff as it seemingly does not affect the performance much):

User.limit(20).
  where.not(id: @user.skips.select(:partner_id), number_of_photos: 0).
  where(age: 0..25)

This is the SQL query that gets executed. Note how ids to skip are fetched by a subquery:

SELECT  "users".* FROM "users"
  WHERE ("users"."number_of_photos" != 0)
    AND ("users"."id" NOT IN (
      SELECT "skips"."partner_id"
        FROM "skips"
        WHERE "skips"."user_id" = 1
    ))
    AND ("users"."age" BETWEEN 0 AND 25)
  LIMIT 20

Try running your query this way and see how it affects the performance.

Struggling to optimize rails WHERE NOT IN query in Rails

Answers (1)

Related Questions