Reputation: 14335
I'm trying to process a CSV file having ~73Billion Rows,
I'm storing the processed rows into a python collections.defaultdict having string as key and tuples as value, however to store this data structure into dictionary is taking ~100 seconds to store 50K rows.
I'm processing the CSV file in chunks of 50K rows in order to make sure system doesn't go out of memory or to avoid disk spill I/O swapping operations.
Later on I'm loading those processed CSV files into Table and making a FULL OUTER JOIN to obtain the combined result.
Example ROW of CSV ID, value:
"10203","http://google.com/goo.gl?key='universe'&value='somedata'"
Data Structure:
dt = {'goog': [(10203, 1), ...}
Basically I'm trying to implement an algorithm for full text search feature - for that I need to maintain positions of value in parts of 4 characters with its associated ID.
Upvotes: 3
Views: 1228
Reputation: 14335
Finally I have found the perfect solution, best suitable for my current requirement.
Previously this task was running around ~20-24 hours, and now it takes around half an hour.
The programming model I was looking for was Map Reduce programming model. Which was easy to use and easy to code for the requirement I had.
Its really faster & efficiently written: I'm using gpmapreduce utility with Python Programming language for the same.
Thanks to: @Srikar-Appal its almsot similar to his 4th solution - based on which I inspired to use mapreduce model.
Upvotes: 0
Reputation: 73588
Here are some things that come to mind -
n
smaller files. Start a master process which forks n
sub-processes to process these n chunk files in parallel. that way in 100 seconds you would theoretically get n * 50,000
rows processed. Note that I am saying "theoretically" since if all this is happening on a single harddisk, the harddisk might not transmit data concurrently. So there might be delay in satisfying concurrent requests (but again the I/O algorithms that run on modern operating systems cache a portion of the file being read which might give you close to the above mentioned figures).x machines
each with y harddisks
. Make x*y chunks
of your file and put them in these machines. and run your processing program. So you process x*y*50000 rows
in the same 100 seconds. The throughput increases with the number of machines and harddisks employed. You might have to deal with some newer problems of distributed computing (like availability, fault tolerance etc.) but still...Point 2, 3 and 4 are predicated on the assumption that each row in your CSV file can be processed independently and that there is no dependency amongst rows.
Upvotes: 5