HOON
HOON

Reputation: 33

How can I read a large number of files with Pandas?

Number of file : 894 total file size : 22.2GB I have to do machine learning by reading many csv files. There is not enough memory to read at once.

Upvotes: 1

Views: 366

Answers (2)

Esraa Abdelmaksoud
Esraa Abdelmaksoud

Reputation: 1689

You can read your files in chunks but not during the training phase. You have to select an appropriate algorithm for your files. However, having such big files for model training mostly means you have to do some data preparation first, which will reduce the size of the files significantly.

Upvotes: 1

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

Specifically to load a large number of files that do not fit in memory, one can use dask:

import dask.dataframe as dd
df = dd.read_csv('file-*.csv')

This will create a lazy version of the data, meaning the data will be loaded only when requested, e.g. df.head() will load the data from the first 5 rows only. Where possible pandas syntax will apply to dask dataframes.

For machine learning you can use dask-ml which has tight integration with sklearn, see docs.

Upvotes: 2

Related Questions