Reputation: 67
I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:
from tqdm import tqdm
chunksize=100000
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
df_list.append(df_chunk)
train_df = pd.concat(df_list)
Upvotes: 0
Views: 1057
Reputation: 40894
Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk)
is the culprit.
What you need to is:
df_list
.Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.
Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.
Upvotes: 1