What are the different ways to access really large csv files?

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.

I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step. He has used a catdevnull process with the code as shown:

def catDevNull():
    os.system('cat %s > /dev/null' % fn)

The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:

def wc():
    os.system('wc -l %s > /dev/null' % fn)

The above two methods are the fastest. Using pandas.read_csv for the task, the time is less than other methods, but still slower than the above two methods.

Putting x = os.system('cat %s > /dev/null % fn), and checking the data type is a string.

How does os.system read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system for further processing?

I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?

Upvotes: 0

Answers (2)

Phoenix

Reputation: 425

Based on my testing. I came across the fact that it is a lot faster to query in a pandas dataframe than querying in the database[tested for sqlite3]

Thus, the fastest way is to get the csv as a pandas dataframe, and then query in the dataframe as required. Also, if I need to save the file, I can pickle the dataframe, and reuse it as required. The time to pickle and unpickle file and querying is a lot lesser than storing the data in sql and then querying for the results.

Upvotes: 0

tripleee

Reputation: 189477

os.system completely relinquishes the control you have in Python. There is no way to access anything which happened in the subprocess after it has finished.

A better way to have some (but not sufficient) control over a subprocess is to use the Python subprocess module. This allows you to interact with the running process using signals and I/O, but still, there is no way to affect the internals of a process unless it has a specific API for allowing you to do that. (Linux exposes some process internals in the /proc filesystem if you want to explore that.)

I don't think you understand what the benchmark means. The cat >/dev/null is a baseline which simply measures how quickly the system is able to read the file off the disk; your process cannot possibly be faster than the I/O channel permits, so this is the time that the system takes for doing nothing at all. You would basically subtract this time from the subsequent results before you compare their relative performance.

Conventionally, the absolutely fastest way to read a large file is to index it, then use the index in memory to seek to the position inside the file you want to access. Building the index causes some overhead, but if you access the file more than once, the benefits soon cancel out the overhead. Importing your file to a database is a convenient and friendly way to do this; the database encapsulates the I/O completely and lets you query the data as if you could ignore that it is somehow serialized into bytes on a disk behind the scenes.

Upvotes: 3

What are the different ways to access really large csv files?

Answers (2)

Related Questions