Reputation: 5191
This is a simple problem that I can't seem to find an elegant solution to. I am trying to select the rows of a data frame where two of the columns form a pair from a separate list.
For example:
import pandas as pd
df = pd.DataFrame({'a': range(8), 'b': range(8), 'c': list('zyxwvuts')})
pairs = [(4, 4), (5, 6), (6, 6), (7, 9)]
# The data has an arbitrary number of columns, but I just want
# to match 'a' and 'b'
df
a b c
0 0 0 z
1 1 1 y
2 2 2 x
3 3 3 w
4 4 4 v
5 5 5 u
6 6 6 t
7 7 7 s
In this example, my list pairs
contains the combination of df.a
and df.b
at rows 4 and 6. What I would like is a clean way to get the data frame given by df.iloc[[4, 6], :]
.
Is there a pandas
or numpy
way to do this without explicitly looping over pairs
?
The solution using broadcasting is both clean and fast, as well as scaling very well.
def with_set_index(df, pairs):
return df.set_index(['a','b']).loc[pairs].dropna()
def with_tuple_isin(df, pairs):
return df[df[['a','b']].apply(tuple,1).isin(pairs)]
def with_array_views(df, pairs):
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A, B = view1D(df[['a','b']].values, np.asarray(pairs))
return df[np.isin(A, B)]
def with_broadcasting(df, pairs):
return df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]
%timeit with_set_index(df, pairs)
# 7.35 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit with_tuple_isin(df, pairs)
# 1.89 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit with_array_views(df, pairs)
# 917 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit with_broadcasting(df, pairs)
# 879 µs ± 8.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Views: 2003
Reputation: 221544
A vectorized one based on array-views -
# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A,B = view1D(df[['a','b']].values,np.asarray(pairs))
out = df[np.isin(A,B)]
Output for given sample -
In [263]: out
Out[263]:
a b c
4 4 4 v
6 6 6 t
If you are looking for a compact/clean version, we can also leverage broadcasting
-
In [269]: df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]
Out[269]:
a b c
4 4 4 v
6 6 6 t
Upvotes: 2
Reputation: 323226
tuple
with isin
df[df[['a','b']].apply(tuple,1).isin(pairs)]
Out[686]:
a b c
4 4 4 v
6 6 6 t
Upvotes: 6