Find average in large csv file using pandas

Question

I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.

I need to find the average distance for each month.

This is what I have so far:

import pandas as pd
for x in range(1, 60):
    df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
    for chunk in df:
        print df["distance"].mean()

First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.

But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.

Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .

Andrew · Accepted Answer

Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.

distance_array = []
for x in range(1,61):
   df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
   for index, row in df.iterrows():
      distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))

Find average in large csv file using pandas

Answers (2)

Related Questions