shahjapan
shahjapan

Reputation: 14335

How to process huge CSV file into python?

I'm trying to process a CSV file having ~73Billion Rows,
I'm storing the processed rows into a python collections.defaultdict having string as key and tuples as value, however to store this data structure into dictionary is taking ~100 seconds to store 50K rows.

I'm processing the CSV file in chunks of 50K rows in order to make sure system doesn't go out of memory or to avoid disk spill I/O swapping operations.

Later on I'm loading those processed CSV files into Table and making a FULL OUTER JOIN to obtain the combined result.

Example ROW of CSV ID, value:

"10203","http://google.com/goo.gl?key='universe'&value='somedata'"

Data Structure:

dt = {'goog': [(10203, 1), ...}

Basically I'm trying to implement an algorithm for full text search feature - for that I need to maintain positions of value in parts of 4 characters with its associated ID.

Upvotes: 3

Views: 1228

Answers (2)

shahjapan
shahjapan

Reputation: 14335

Finally I have found the perfect solution, best suitable for my current requirement.

Previously this task was running around ~20-24 hours, and now it takes around half an hour.

The programming model I was looking for was Map Reduce programming model. Which was easy to use and easy to code for the requirement I had.

Its really faster & efficiently written: I'm using gpmapreduce utility with Python Programming language for the same.

Thanks to: @Srikar-Appal its almsot similar to his 4th solution - based on which I inspired to use mapreduce model.

Upvotes: 0

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73588

Here are some things that come to mind -

  1. As @Burhad suggest, why cant you load this file directly in to the DB? Any kind of string processing like you are doing can be done in regular RDBMS like MySQL. They have string function you know. A simple DB routine could do this all within the DB itself without even writing the data to file in the first place.
  2. If you dont want to take the above approach. I suggest you try this. Split the file into lets say n smaller files. Start a master process which forks n sub-processes to process these n chunk files in parallel. that way in 100 seconds you would theoretically get n * 50,000 rows processed. Note that I am saying "theoretically" since if all this is happening on a single harddisk, the harddisk might not transmit data concurrently. So there might be delay in satisfying concurrent requests (but again the I/O algorithms that run on modern operating systems cache a portion of the file being read which might give you close to the above mentioned figures).
  3. An extension of above approach is to use multiple hardisks all being part of the same raid level on the same local machine.
  4. If you require even more throughput think distributed computing. Like say x machines each with y harddisks. Make x*y chunks of your file and put them in these machines. and run your processing program. So you process x*y*50000 rows in the same 100 seconds. The throughput increases with the number of machines and harddisks employed. You might have to deal with some newer problems of distributed computing (like availability, fault tolerance etc.) but still...

Point 2, 3 and 4 are predicated on the assumption that each row in your CSV file can be processed independently and that there is no dependency amongst rows.

Upvotes: 5

Related Questions