Neeraj Hanumante
Neeraj Hanumante

Reputation: 1684

How quantify the reading progress of large CSV files through pd.read_csv and chunks?

Analogy/Example

Let's say I have a list:

test_list = [2, 5, 3, 6]
number_of_elements = len(test_list)

Then enumerate can be used with number_of_elements to track the progress of a loop as follows:

for j, element in enumerate(test_list):
    do something
    print('completed {} out of {}'.format(j, number_of_elements))

Question

Large csv files can be read as shown below (reference answer):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

How to track the progress of this loop?

Attempt

file_chunks = pd.read_csv(file_name, chunksize=100000)
number_of_chunks = len(file_chunks)
for j, chunk in enumerate(pd.read_csv(file_name, chunksize=100000)):
    print(j, number_of_chunks)

Following is the error:

TypeError: object of type 'TextFileReader' has no len()

Upvotes: 0

Views: 194

Answers (1)

Myccha
Myccha

Reputation: 1018

You almost have it, the only problem is that there is no easy way for len to know how big the file is before reading it.

If you did:

file_chunks = pd.read_csv(file_name, chunksize=100000)

for i, chunk in enumerate(file_chunks):
    print(i)

That would work.

Also, this is a great use case for Dask (a python library that imitates a lot of pandas for big files)

Upvotes: 1

Related Questions