Reputation: 544
I'm trying to deal with multiple database entries. We've been scrapping the data from a site and someone forgot to write a check against current data that had already been created. We've got around 12,000 entries now.
I'm looking to go through each of the entries and delete the ones that are duplicate. This is what I have so far:
for Listing.each(:order => "street_address DESC") do |listing|
if listing.scraped == true
if previous_street_address && listing.street_address == previous_street_address
listing.delete
end
end
previous_street_address = listing.street_address
end
This is the database schema:
create_table "listings", :force => true do |t|
t.integer "category"
t.string "street_address"
t.string "city"
t.string "state"
t.datetime "availability"
t.integer "bedrooms"
t.integer "bathrooms"
t.integer "square_footage"
t.string "short_description"
t.text "long_description"
t.integer "price"
t.integer "period"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
t.integer "landlord_id"
t.float "latitude"
t.float "longitude"
t.datetime "expires_at"
t.boolean "warned_of_expiration"
t.integer "views", :default => 0, :null => false
t.boolean "fake"
t.string "short_web_url"
t.string "apartment_no"
t.string "zipcode"
t.datetime "delisted_at"
t.integer "daily_rate"
t.integer "photos_count"
t.string "delisted_reason"
t.boolean "scraped"
t.boolean "cats", :default => false
t.boolean "dogs", :default => false
end
I know I need to sort the data in a meaningful way but I don't think Street Address works as if the address is "123 Anywhere Street" and I Listing.street_address.to_i it will convert it to 123 and then sort by those numbers which could cause issues. I'm looking for some fresh eyes on this problem.
Upvotes: 0
Views: 143
Reputation: 63
You may find it easier to do this directly in Postgres. This thread has several examples.
Upvotes: 0
Reputation: 15779
12,000 isn't that much to fit in memory. How about this solution:
all = Listing.all
to_remove = all-all.uniq{|el| [el.street_address, ..., ...] }
to_remove.each(&:delete)
Upvotes: 5