rigelstpierre
rigelstpierre

Reputation: 544

Dealing with Duplicate Database Entries

I'm trying to deal with multiple database entries. We've been scrapping the data from a site and someone forgot to write a check against current data that had already been created. We've got around 12,000 entries now.

I'm looking to go through each of the entries and delete the ones that are duplicate. This is what I have so far:

for Listing.each(:order => "street_address DESC") do |listing| 
  if listing.scraped == true
    if previous_street_address && listing.street_address == previous_street_address
      listing.delete
    end
  end
  previous_street_address = listing.street_address
end

This is the database schema:

  create_table "listings", :force => true do |t|
    t.integer  "category"
    t.string   "street_address"
    t.string   "city"
    t.string   "state"
    t.datetime "availability"
    t.integer  "bedrooms"
    t.integer  "bathrooms"
    t.integer  "square_footage"
    t.string   "short_description"
    t.text     "long_description"
    t.integer  "price"
    t.integer  "period"
    t.datetime "created_at",                              :null => false
    t.datetime "updated_at",                              :null => false
    t.integer  "landlord_id"
    t.float    "latitude"
    t.float    "longitude"
    t.datetime "expires_at"
    t.boolean  "warned_of_expiration"
    t.integer  "views",                :default => 0,     :null => false
    t.boolean  "fake"
    t.string   "short_web_url"
    t.string   "apartment_no"
    t.string   "zipcode"
    t.datetime "delisted_at"
    t.integer  "daily_rate"
    t.integer  "photos_count"
    t.string   "delisted_reason"
    t.boolean  "scraped"
    t.boolean  "cats",                 :default => false
    t.boolean  "dogs",                 :default => false
  end

I know I need to sort the data in a meaningful way but I don't think Street Address works as if the address is "123 Anywhere Street" and I Listing.street_address.to_i it will convert it to 123 and then sort by those numbers which could cause issues. I'm looking for some fresh eyes on this problem.

Upvotes: 0

Views: 143

Answers (2)

syncopated
syncopated

Reputation: 63

You may find it easier to do this directly in Postgres. This thread has several examples.

Upvotes: 0

jdoe
jdoe

Reputation: 15779

12,000 isn't that much to fit in memory. How about this solution:

all = Listing.all
to_remove = all-all.uniq{|el| [el.street_address, ..., ...] }
to_remove.each(&:delete)

Upvotes: 5

Related Questions