max pleaner
max pleaner

Reputation: 26788

ActiveRecord query searching for duplicates on a column, but returning associated records

So here's the lay of the land:

I have a Applicant model which has_many Lead records.

I need to group leads by applicant email, i.e. for each specific applicant email (there may be 2+ applicant records with the email) i need to get a combined list of leads.

I already have this working using an in-memory / N+1 solution

I want to do this in a single query, if possible. Right now I'm running one for each lead which is maxing out the CPU.

Here's my attempt right now:

Lead.
all.
select("leads.*, applicants.*").
joins(:applicant).
group("applicants.email").
having("count(*) > 1").
limit(1).
to_a

And the error:

  Lead Load (1.2ms)  SELECT  leads.*, applicants.* FROM "leads" INNER 
  JOIN "applicants" ON "applicants"."id" = "leads"."applicant_id" 
  GROUP BY applicants.email HAVING count(*) > 1 LIMIT 1
  ActiveRecord::StatementInvalid: PG::GroupingError: ERROR:  column
  "leads.id" must appear in the GROUP BY clause or be used in an 
  aggregate function
  LINE 1: SELECT  leads.*, applicants.* FROM "leads" INNER JOIN 
  "appli...

Upvotes: 0

Views: 119

Answers (2)

Edmund Lee
Edmund Lee

Reputation: 2584

This is a postgres specific issue. "the selected fields must appear in the GROUP BY clause".

must appear in the GROUP BY clause or be used in an aggregate function

You can try this

Lead.joins(:applicant)
    .select('leads.*, applicants.email')
    .group_by('applicants.email, leads.id, ...')

You will need to list all the fields in leads table in the group by clause (or all the fields that you are selecting).

I would just get all the records and do the grouping in memory. If you have a lot of records, I would paginate them or batch them.

group_by_email = Hash.new { |h, k| h[k] = [] }

Applicant.eager_load(:leads).each_batch(10_000) do |batch|
  batch.each do |applicant|
    group_by_email[:applicant.email] << applicant.leads
  end
end

Upvotes: 1

ruby_newbie
ruby_newbie

Reputation: 3285

You need to use a .where rather than using Lead.all. The reason it is maxing out the CPU is you are trying to load every lead into memory at once. That said I guess I am still missing what you actually want back from the query so it would be tough for me to help you write the query. Can you give more info about your associations and the expected result of the query?

Upvotes: 0

Related Questions