Nicu-Cristian Vlad
Nicu-Cristian Vlad

Reputation: 98

search for Duplicate Records: on Database using SQL, or on file.txt using PHP?

I have a 14.7 GB txt file with 4 columns and ~440 Million rows. I have to add them into my database, but I would like to make sure that I don't have duplicates.

Since I have zipcodes/city/latitude/longitude, I figured that I only need to select "distinct latitude and longitude", because there can't be 2 places with same lat AND long.

I want to know if there is a simple solution to make this duplicate's search before I add all my data to the DB.

The file is way to big to make anything with it so I split it in files of 700MB and I used:

LOAD DATA LOCAL INFILE '...\text6.txt' IGNORE
INTO TABLE zipCodes
FIELDS TERMINATED BY ';' LINES TERMINATED BY '\r\n'
(zipCode, city, latitude, longitude);

to add each file into the DB.

Now the problem is after I added about 80M lines, if I try to make the select on the local db it's gonna take me to much time to do it and search for duplicates.

How should I do it? I'm new in this kind of work and need to learn, as much as possible and as fast as possible.

Upvotes: 0

Views: 50

Answers (1)

VolkerK
VolkerK

Reputation: 96159

I'd remove/disable the unique index before LOAD FILE for a fast import of all the data.
And after the raw import is done create/activate a new unique index via
ALTER IGNORE TABLE zipCodes ADD UNIQUE (latitude,longitude) (*mysql-specific extension).
It will (silently) drop duplicates from the table and is usually faster then constantly updating the index when inserting data while the index is active.

Upvotes: 3

Related Questions