Tucker
Tucker

Reputation: 7362

Importing 10 billion rows into mysql

I have a .csv file with 10 billion rows. I want to check that each row is unique. Is there an easy way to do this? I was thinking perhaps importing to mysql would allow me to find out uniqueness quickly. How can I upload this huge file to mysql? I have already tried row-by-row insert statements and also the 'LOAD DATA INFILE' command but both failed.

Thanks

Upvotes: 2

Views: 3506

Answers (6)

Mike T
Mike T

Reputation: 43612

I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.

Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):

sort ten.csv > ten_sorted.csv

Then find unique rows from sorted data:

uniq ten_sorted.csv > ten_uniq.csv

Now you can check to see how many rows there are in the final file:

wc ten_uniq.csv

Or you can just use pipes for combining the three steps with one command line:

sort ten.csv | uniq | wc

Upvotes: 3

mikkokotila
mikkokotila

Reputation: 1431

This is the kind of job linux is "made for".

First you have to split the file in to many smaller files:

split -l 100 filename

After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:

LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt

real    0m1.283s
user    0m1.121s
sys     0m0.088s

Where as the same without LC=ALL_C:

sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt

real    0m24.596s
user    0m24.065s
sys     0m0.201s

Piping the command and using LC_ALL=C was 2x slower than the fastest:

LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt

real    0m3.532s
user    0m3.677s
sys     0m0.106s

Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.

Upvotes: 0

vpit3833
vpit3833

Reputation: 7951

If you are importing from Excel or such other programs. See here for how to cleanse the csv file before importing it into MySQL. Regarding the unique row, as long as your table schema is right, MySQL should be able to take care of it.

EDIT:

Whether the source is Excel or not, LOAD DATA LOCAL INFILE appears to be the way to go.

10bn rows, and LOAD DATA LOCAL gives you error? Are you sure there is no problem with the csv file?

Upvotes: 1

Alan Savage
Alan Savage

Reputation: 842

If you do have 10 billion rows then you will struggle working with this data.

You would need to look at partitioning your database (ref here: about mysql partitioning)

However, even with that large number you would be requiring some serious hardware to cut through the work involved there.

Also, what would you do if a row was found to be nonunique? Would you want to continue importing the data? If you import the data would you import the identical row or flag it as a duplicate? Would you stop processing.

Upvotes: 0

Strong Like Bull
Strong Like Bull

Reputation: 11297

You have to truncate your database into separate small bite size chunks. Use Big Dump.

http://www.ozerov.de/bigdump.php

Upvotes: 0

Nick Rolando
Nick Rolando

Reputation: 26157

Does the data have a unique identifier? Have this column as primary key in your mysql table and when you go to import the data, mysql should throw an error if you have duplicates.

As for how to go about doing it..just read in the file row by row and do an insert on each row.

Upvotes: 1

Related Questions