Rohit Khanna
Rohit Khanna

Reputation: 57

Pandas read_csv large file Performance improvement

Just was wondering if there is a way to improve the performance of reading large csv files into a pandas dataframe. I have 3 large (3.5MM records each) pipe delimited file which I want to load into dataframe and perform some task on it. Currently I am using pandas.read_csv() defining the cols and there datatypes in the parameter like below. I did see some improvement by defining the datatype of the columns but it still takes more than 3 minutes to load.

import pandas as pd
df = pd.read_csv(file_, index_col=None, usecols = sourceFields, sep='|', header=0, dtype={'date':'str', 'gwTimeUtc':'str', 'asset':'|str', 
'instrumentId':'|str', 'askPrice':'float64', 'bidPrice':'float64', 
'askQuantity':'float64', 'bidQuantity':'float64', 'currency':'|str', 
'venue':'|str', 'owner':'|str', 'status':'|str', 'priceNotation':'|str', 'nominalQuantity':'float64'})

Upvotes: 1

Views: 2471

Answers (2)

Nathan Dai
Nathan Dai

Reputation: 522

The .feather file is significantly faster than .csv. Pandas has built-in support for feather files.

Read the csv in using pd.read_csv(path) and then export it to a feather file: pd.to_feather(path). Now, read the feather file instead of csv.

In my case, a 950 MB csv file was compressed to a 180 MB feather file. Instead of taking 30 seconds to read, it takes about 1 second. I know I am a bit late to the party, but feather files are seriously underrated.

Upvotes: 1

jpp
jpp

Reputation: 164803

Depending on what you wish to do with the data, a good option is dask.dataframe. This library works out-of-memory, and allows you to perform a subset of pandas operations lazily. You can then bring the results in memory as a pandas dataframe. Below is example code you can try:

import dask.dataframe as dd, pandas as pd

# point to all files beginning with "file"
dask_df = dd.read_csv('file*.csv')

# define your calculations as you would in pandas
dask_df['col2'] = dask_df['col1'] * 2

# compute results & return to pandas
df = dask_df.compute()

Crucially, nothing significant is computed until the very last line.

Upvotes: 2

Related Questions