Reputation: 98
I have a 14.7 GB txt file with 4 columns and ~440 Million rows. I have to add them into my database, but I would like to make sure that I don't have duplicates.
Since I have zipcodes/city/latitude/longitude, I figured that I only need to select "distinct latitude and longitude", because there can't be 2 places with same lat AND long.
I want to know if there is a simple solution to make this duplicate's search before I add all my data to the DB.
The file is way to big to make anything with it so I split it in files of 700MB and I used:
LOAD DATA LOCAL INFILE '...\text6.txt' IGNORE
INTO TABLE zipCodes
FIELDS TERMINATED BY ';' LINES TERMINATED BY '\r\n'
(zipCode, city, latitude, longitude);
to add each file into the DB.
Now the problem is after I added about 80M lines, if I try to make the select on the local db it's gonna take me to much time to do it and search for duplicates.
How should I do it? I'm new in this kind of work and need to learn, as much as possible and as fast as possible.
Upvotes: 0
Views: 50
Reputation: 96159
I'd remove/disable the unique index before LOAD FILE for a fast import of all the data.
And after the raw import is done create/activate a new unique index via
ALTER IGNORE TABLE zipCodes ADD UNIQUE (latitude,longitude)
(*mysql-specific extension).
It will (silently) drop duplicates from the table and is usually faster then constantly updating the index when inserting data while the index is active.
Upvotes: 3