Reading large text files with Pandas recommendation?

Question

I am reading a large csv file 25GB into pandas.DataFrame. My pc specifications are:

Intel core i7-8700 3.2GHz
RAM 16G
windows 10
DataFrame.shape =144,000,000 rows by 13 cols
csv file size on disk says 24GB

reading this file takes a long time like 20 minutes sometimes. Is there any recommendation, code wise, that I can do better?

*note: This DF is needed in whole, since I am going to Join(Merge) with another one.

nu11_hypothesis · Accepted Answer

You could use a dask.dataframe:

import dask.dataframe as dd # import dask.dataframe
df = dd.read_csv('filename.csv') # read csv

or you could use chunking:

def chunk_processing(): # define a function that you will use on chunks
    ## Do Something # your function code here


chunk_list = [] # create an empty list to hold chunks
chunksize = 10 ** 6 # set chunk size
for chunk in pd.read_csv('filename.csv', chunksize=chunksize): # read in csv in chunks of chunksize
    processed_chunk = chunk_processing(chunk) # process the chunks with chunk_processing() function
    chunk_list.append(processed_chunk) # append the chunks to a list
df_concat = pd.concat(chunk_list) # concatenate the list to a dataframe

Reading large text files with Pandas recommendation?

Answers (1)

Related Questions