Reputation: 35
I would like to load large .csv (3.4m rows, 206k users) open sourced dataset from InstaCart https://www.instacart.com/datasets/grocery-shopping-2017
Basically, I have trouble loading orders.csv into Pandas DataFrame. I would like to learn best practices for loading large files into Pandas/Python.
Upvotes: 1
Views: 1774
Reputation: 6962
Depending on your machine you may be able to read all of it in memory by specifying the data types while reading the csv file. When a csv is read by pandas then the default data types used may not be the best ones. Using dtype
you can specify the data types. It reduces the size of the data frame read into the memory.
Upvotes: 0
Reputation: 72735
When you have large data frames that might not fit in memory, dask is quite useful. The main page I've linked to has examples on how you can create a dask dataframe that has the same API as the pandas one but which can be distributed.
Upvotes: 0
Reputation: 21666
Best option would be to read the data in chunks instead of loading the whole file into memory.
Luckily, read_csv
method accepts chunksize
argument.
for chunk in pd.read_csv(file.csv, chunksize=somesize):
process(chunk)
Note: By specifying a chunksize
to read_csv
or read_table
, the return value will be an iterable
object of type TextFileReader
:
Also see:
Upvotes: 3