Reputation: 2701
I'm serving API requests with fairly tight latency requirements and the data I want to transform is posted one row at a time. I was confused to see the pandas read_csv method takes around 2ms, which I can't afford to give up just to load the data.
Are there further improvements possible on the code below, such as an argument I'm missing which would speed things up with this size of data?
from io import StringIO
import pandas as pd
import numpy as np
example_input = '1969,EH10,consumer'
The pandas library method with the best optimisation I could find was with the following arguments :
%%timeit
s = StringIO(example_input)
df = pd.read_csv(s,
sep=',',
header=None,
engine='c',
names=['dob', 'postcode', 'contract'],
dtype=str,
compression=None,
na_filter=False,
low_memory=False)
which locally returns
1.75 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I was able to get considerable speedup loading with numpy and then creating a dataframe:
%%timeit
s = StringIO(example_input)
a = np.genfromtxt(s, delimiter=',', dtype=str)
df = pd.DataFrame(a.reshape(1, -1),
columns=['dob', 'postcode', 'contract'])
which gives 415 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
which is more acceptable for my application. (The loading just into a numpy array can be done in ~70.4 µs without the load into a dataframe so I may end up working with that)
However, is it possible to speed up the pd.read_csv
example further, and if not - can anyone help me understand the reasons behind the big delta here?
Upvotes: 4
Views: 328
Reputation: 231665
Normally we see that pd.read_csv
is faster than genfromtxt
. But evidently it has a startup time, which dominates in this 1 row case.
In [95]: example_input = '1969,EH10,consumer'
In [96]: np.genfromtxt([example_input], delimiter=',',dtype=str)
Out[96]: array(['1969', 'EH10', 'consumer'], dtype='<U8')
But why not just split the string and make an array from that? It's more direct and much faster:
In [97]: np.array(example_input.split(','))
Out[97]: array(['1969', 'EH10', 'consumer'], dtype='<U8')
Making the dataframe from this array takes longer.
In [106]: timeit np.array(example_input.split(','))
2.89 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [107]: timeit pd.DataFrame(np.array(example_input.split(','))[None,:], col
...: umns=['dob', 'postcode', 'contract'])
406 µs ± 6.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 3