Increase speed of pandas read_csv for single row

Question

I'm serving API requests with fairly tight latency requirements and the data I want to transform is posted one row at a time. I was confused to see the pandas read_csv method takes around 2ms, which I can't afford to give up just to load the data.

Are there further improvements possible on the code below, such as an argument I'm missing which would speed things up with this size of data?

from io import StringIO
import pandas as pd
import numpy as np

example_input = '1969,EH10,consumer'

The pandas library method with the best optimisation I could find was with the following arguments :

%%timeit
s = StringIO(example_input)
df = pd.read_csv(s,
                 sep=',',
                 header=None,
                 engine='c',
                 names=['dob', 'postcode', 'contract'],
                 dtype=str,
                 compression=None,
                 na_filter=False,
                 low_memory=False)

which locally returns 1.75 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I was able to get considerable speedup loading with numpy and then creating a dataframe:

%%timeit
s = StringIO(example_input)
a = np.genfromtxt(s, delimiter=',', dtype=str)
df = pd.DataFrame(a.reshape(1, -1),
                  columns=['dob', 'postcode', 'contract'])

which gives 415 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) which is more acceptable for my application. (The loading just into a numpy array can be done in ~70.4 µs without the load into a dataframe so I may end up working with that)

However, is it possible to speed up the pd.read_csv example further, and if not - can anyone help me understand the reasons behind the big delta here?

Increase speed of pandas read_csv for single row

Answers (1)

Related Questions