nonegiven72
nonegiven72

Reputation: 427

Removing duplicates from Rails postgres database

I have a local PostgreSQL database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby, and still I struggle with ActiveRecord.

Here's how I found the duplicates (in Rails console):

Summary.select(:map_id).group(:map_id).having("count(*) > 1")

I don't think I can simply add destroy_all to the end of that statement as it will destroy all instances of that entry, including the duplicate values.

Could you please tell me how to update this so that it removes the duplicates?

Upvotes: 3

Views: 1605

Answers (3)

seph
seph

Reputation: 6076

I would go to the db console(rails dbconsole) and do:

SELECT DISTINCT ON (map_id) * FROM summaries AS some_temp_name;

Then rename the tables.

EDIT - This seems like what you're looking for:

Summary.where.not(id: Summary.group(:map_id).pluck('min(summaries.id)')).delete_all

NOT TESTED. It was part of this answer here: Rails: Delete duplicate records based on multiple columns

Upvotes: 3

Michael Gaskill
Michael Gaskill

Reputation: 8042

This will destroy the duplicates in waves, selecting only a single duplicate per map_id, on each pass. The loop will automatically finish when no more duplicates exist.

loop do
  duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
  break if duplicates.length == 0
  duplicates.destroy_all
end

If the database looks like this:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |
|  4 |    235 |
|  5 |    235 |
|  6 |    299 |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the first wave, these records would be returned and destroyed:

| id | map_id |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the second wave, this record would be returned and destroyed:

| id | map_id |
|  5 |    235 |
|  6 |    299 |

The third wave would return and destroy this record:

| id | map_id |
|  4 |    235 |

The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id, it's likely that this process will finish in single-digit loop iterations.

Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:

  duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")

In that case, wave 1 would return and destroy:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |

Wave 2 would return and destroy:

| id | map_id |
|  4 |    235 |
|  6 |    299 |

Wave 3 would return and destroy:

| id | map_id |
|  5 |    235 |

Wave 4 would complete the process.

Upvotes: 4

Simone Carletti
Simone Carletti

Reputation: 176362

What I would suggest to do, is to fetch all the records and order by the field that has duplicates.

Then loop all the records and just keep one record per value.

value = nil
Summary.order("map_id ASC").each do |record|
  if record.map_id == value
    # duplicate
    record.destroy
  else
    # first entry
    value = record.map_id
  end
end

Upvotes: 1

Related Questions