How to optimize querying for thousands of IDs

Question

Here are three consecutive queries with their Benchmark performance:

ids = @Company.projects.submitted.uniq.collect(&:person_id)
  1.370000   0.060000   1.430000 (  3.763946)

@persons = Person.where("id IN (?)", ids)
  0.030000   0.000000   0.030000 (  0.332878)

@emails = @persons.collect(&:email).reject(&:blank?)
  16.550000  1.640000  18.190000  (128.002465)

ids contains almost 10000 ids, and in running the last query I'm seeing:

SELECT "persons".* FROM "persons" WHERE (id in (121,142,173,178...14202))
(*1000s ->) User Load (13.0ms)  SELECT "users".* FROM "users" WHERE "users"."roleable_type" = 'Person' AND "users"."roleable_id" = 121 LIMIT 1

Indexes on User:
add_index "users", ["roleable_id", "roleable_type"], :name => "index_users_on_roleable_id_and_roleable_type"
add_index "users", ["roleable_type", "roleable_id"], :name => "index_users_on_roleable_type_and_roleable_id"

How can I fix what's happening here?

PinnyM · Accepted Answer

The 2nd query as you have it isn't actually hitting the database. It's building up an ActiveRecord::Relation (a lazy query) that isn't triggered until the 3rd query is called. You can demonstrate this by adding .all to the end of the 2nd query.

To fix the performance issue, you want to get rid of the IN () with a literal list, as this can really hurt database performance for large lists:

@persons = Person.joins(:projects).merge(@Company.projects.submitted)

You can also do this using a subquery (albeit less efficiently than a JOIN):

subquery = @Company.projects.submitted.select("projects.person_id").to_sql
@persons = Person.where("id IN (#{subquery})")

If you only want to get the resulting @emails, and don't really need the @persons collection, you can make this slightly more efficient like so:

@email = Person.joins(:projects).merge(@Company.projects.submitted).
                  where("LENGTH(persons.email) > 0").pluck(:email)

How to optimize querying for thousands of IDs

Answers (1)

Related Questions