reading a 20gb csv file in python

Question

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:

from tqdm import tqdm
chunksize=100000

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
    df_list.append(df_chunk)

train_df = pd.concat(df_list)

reading a 20gb csv file in python

Answers (1)

Related Questions