search for Duplicate Records: on Database using SQL, or on file.txt using PHP?

Question

I have a 14.7 GB txt file with 4 columns and ~440 Million rows. I have to add them into my database, but I would like to make sure that I don't have duplicates.

Since I have zipcodes/city/latitude/longitude, I figured that I only need to select "distinct latitude and longitude", because there can't be 2 places with same lat AND long.

I want to know if there is a simple solution to make this duplicate's search before I add all my data to the DB.

The file is way to big to make anything with it so I split it in files of 700MB and I used:

LOAD DATA LOCAL INFILE '...	ext6.txt' IGNORE
INTO TABLE zipCodes
FIELDS TERMINATED BY ';' LINES TERMINATED BY '
'
(zipCode, city, latitude, longitude);

to add each file into the DB.

Now the problem is after I added about 80M lines, if I try to make the select on the local db it's gonna take me to much time to do it and search for duplicates.

How should I do it? I'm new in this kind of work and need to learn, as much as possible and as fast as possible.

VolkerK · Accepted Answer

I'd remove/disable the unique index before LOAD FILE for a fast import of all the data.
And after the raw import is done create/activate a new unique index via
ALTER IGNORE TABLE zipCodes ADD UNIQUE (latitude,longitude) (*mysql-specific extension).
It will (silently) drop duplicates from the table and is usually faster then constantly updating the index when inserting data while the index is active.

search for Duplicate Records: on Database using SQL, or on file.txt using PHP?

Answers (1)

Related Questions