Reputation: 5188
I have a DataFrame
that resembles:
x y z
--------------
0 A 10
0 D 13
1 X 20
...
and I have two sorted arrays for every possible value for x
and y
:
x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]
so I wrote a function:
def lookup(record, lookup_list, lookup_attr):
return np.searchsorted(lookup_list, getattr(record, lookup_attr))
and then call:
df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')
# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]
but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame
rather than a series?
I tried:
np.where(np.in1d(x_values, df.x))[0]
but this removes duplicate values and that is not desired.
Upvotes: 1
Views: 404
Reputation: 323226
Update using Series
with .loc
, you may can also try with reindex
pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]
Upvotes: 2
Reputation: 402413
You can convert your index arrays to pd.Index
objects to make lookup fast(er).
u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})
x y
0 0 1
1 0 2
2 1 3
Where,
x_values
# [0, 1]
y_values
# ['a', 'A', 'D', 'X']
As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.
val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})
x y
0 0 1
1 0 2
2 1 3
Upvotes: 4