eagle23
eagle23

Reputation: 35

Loading large dataset into Pandas Python

I would like to load large .csv (3.4m rows, 206k users) open sourced dataset from InstaCart https://www.instacart.com/datasets/grocery-shopping-2017

Basically, I have trouble loading orders.csv into Pandas DataFrame. I would like to learn best practices for loading large files into Pandas/Python.

Upvotes: 1

Views: 1774

Answers (3)

Aseem Bansal
Aseem Bansal

Reputation: 6962

Depending on your machine you may be able to read all of it in memory by specifying the data types while reading the csv file. When a csv is read by pandas then the default data types used may not be the best ones. Using dtype you can specify the data types. It reduces the size of the data frame read into the memory.

Upvotes: 0

Noufal Ibrahim
Noufal Ibrahim

Reputation: 72735

When you have large data frames that might not fit in memory, dask is quite useful. The main page I've linked to has examples on how you can create a dask dataframe that has the same API as the pandas one but which can be distributed.

Upvotes: 0

Chankey Pathak
Chankey Pathak

Reputation: 21666

Best option would be to read the data in chunks instead of loading the whole file into memory.

Luckily, read_csv method accepts chunksize argument.

for chunk in pd.read_csv(file.csv, chunksize=somesize):
    process(chunk)

Note: By specifying a chunksize to read_csv or read_table, the return value will be an iterable object of type TextFileReader:

Also see:

Upvotes: 3

Related Questions