philE
philE

Reputation: 1703

Pandas: speed up df.loc based on repeat index values

I have the pandas DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
}).set_index('x')

from which I would like to select rows based on values of the index (x) in the selection array

selection = ['a', 'c', 'b', 'b', 'c', 'a']

The correct output can be obtained by using df.loc as follows

out = df.loc[selection]

The problem I am running in to is df.loc is running pretty slow on large DataFrames (2-7 million rows). Is there a way to speed up this operation? I've looked into eval(), but it doesn't seem to apply to hard-coded lists of index values like this. I have also thought about using pd.DataFrame.isin, but that misses the repeat values (only returns a row per unique element in selection).

Upvotes: 6

Views: 2420

Answers (2)

Alex Riley
Alex Riley

Reputation: 176820

You can get a decent speedup by using reindex instead of loc:

df.reindex(selection)

Timings (version 0.17.0):

>>> selection2 = selection * 100 # a larger list of labels
>>> %timeit df.loc[selection2]
100 loops, best of 3: 2.54 ms per loop

>>> %timeit df.reindex(selection2)
1000 loops, best of 3: 833 µs per loop

The two methods take different paths (hence the speed difference).

loc builds the new DataFrame by calling down to get_indexer_non_unique which is necessarily more complex than the simple get_indexer (used for unique values).

On the other hand, the hard work in reindex appears to be done by the take_* functions in generated.pyx. These functions appear to be faster for the purpose of constructing the new DataFrame.

Upvotes: 5

Colonel Beauvel
Colonel Beauvel

Reputation: 31171

You can try merge:

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
})

df1 = pd.DataFrame({'x':selection})

In [21]: pd.merge(df1,df,on='x', how='left')
Out[21]: 
   x  y  z
0  a  1  f
1  c  2  s
2  b  2  s
3  b  2  s
4  c  2  s
5  a  1  f

Upvotes: 3

Related Questions