Reputation: 427
I have a local PostgreSQL database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby, and still I struggle with ActiveRecord.
Here's how I found the duplicates (in Rails console):
Summary.select(:map_id).group(:map_id).having("count(*) > 1")
I don't think I can simply add destroy_all
to the end of that statement as it will destroy all instances of that entry, including the duplicate values.
Could you please tell me how to update this so that it removes the duplicates?
Upvotes: 3
Views: 1605
Reputation: 6076
I would go to the db console(rails dbconsole
) and do:
SELECT DISTINCT ON (map_id) * FROM summaries AS some_temp_name;
Then rename the tables.
EDIT - This seems like what you're looking for:
Summary.where.not(id: Summary.group(:map_id).pluck('min(summaries.id)')).delete_all
NOT TESTED. It was part of this answer here: Rails: Delete duplicate records based on multiple columns
Upvotes: 3
Reputation: 8042
This will destroy the duplicates in waves, selecting only a single duplicate per map_id
, on each pass. The loop will automatically finish when no more duplicates exist.
loop do
duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
break if duplicates.length == 0
duplicates.destroy_all
end
If the database looks like this:
| id | map_id |
| 1 | 235 |
| 2 | 299 |
| 3 | 324 |
| 4 | 235 |
| 5 | 235 |
| 6 | 299 |
| 7 | 235 |
| 8 | 324 |
| 9 | 299 |
In the first wave, these records would be returned and destroyed:
| id | map_id |
| 7 | 235 |
| 8 | 324 |
| 9 | 299 |
In the second wave, this record would be returned and destroyed:
| id | map_id |
| 5 | 235 |
| 6 | 299 |
The third wave would return and destroy this record:
| id | map_id |
| 4 | 235 |
The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id
, it's likely that this process will finish in single-digit loop iterations.
Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:
duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")
In that case, wave 1 would return and destroy:
| id | map_id |
| 1 | 235 |
| 2 | 299 |
| 3 | 324 |
Wave 2 would return and destroy:
| id | map_id |
| 4 | 235 |
| 6 | 299 |
Wave 3 would return and destroy:
| id | map_id |
| 5 | 235 |
Wave 4 would complete the process.
Upvotes: 4
Reputation: 176362
What I would suggest to do, is to fetch all the records and order by the field that has duplicates.
Then loop all the records and just keep one record per value.
value = nil
Summary.order("map_id ASC").each do |record|
if record.map_id == value
# duplicate
record.destroy
else
# first entry
value = record.map_id
end
end
Upvotes: 1