Efficiently find index of DataFrame values in array

Question

I have a DataFrame that resembles:

x     y     z
--------------
0     A     10
0     D     13
1     X     20
...

and I have two sorted arrays for every possible value for x and y:

x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]

so I wrote a function:

def lookup(record, lookup_list, lookup_attr):
    return np.searchsorted(lookup_list, getattr(record, lookup_attr))

and then call:

df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')

# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]

but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?

I tried:

np.where(np.in1d(x_values, df.x))[0]

but this removes duplicate values and that is not desired.

cs95 · Accepted Answer

You can convert your index arrays to pd.Index objects to make lookup fast(er).

u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})

   x  y
0  0  1
1  0  2
2  1  3

Where,

x_values
# [0, 1]

y_values
# ['a', 'A', 'D', 'X']

As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.

val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
    f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})

   x  y
0  0  1
1  0  2
2  1  3

Efficiently find index of DataFrame values in array

Answers (2)

Related Questions