Aakash Sharma
Aakash Sharma

Reputation: 67

reading a 20gb csv file in python

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:

from tqdm import tqdm
chunksize=100000

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
    df_list.append(df_chunk)

train_df = pd.concat(df_list)

Upvotes: 0

Views: 1057

Answers (1)

9000
9000

Reputation: 40894

Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk) is the culprit.

What you need to is:

  • read it by smaller pieces (you already do);
  • process it piece by piece;
  • discard the old piece after processing. Python's garbage collection will do it for you unless you keep a reference to the spent chunk, as you currently do in df_list.

Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.

Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.

Upvotes: 1

Related Questions