C. Braun
C. Braun

Reputation: 5191

Select where a subset of columns in a pandas DataFrame match a tuple

This is a simple problem that I can't seem to find an elegant solution to. I am trying to select the rows of a data frame where two of the columns form a pair from a separate list.

For example:

import pandas as pd

df = pd.DataFrame({'a': range(8), 'b': range(8), 'c': list('zyxwvuts')})
pairs = [(4, 4), (5, 6), (6, 6), (7, 9)]

# The data has an arbitrary number of columns, but I just want
# to match 'a' and 'b'
df
    a   b   c
0   0   0   z
1   1   1   y
2   2   2   x
3   3   3   w
4   4   4   v
5   5   5   u
6   6   6   t
7   7   7   s

In this example, my list pairs contains the combination of df.a and df.b at rows 4 and 6. What I would like is a clean way to get the data frame given by df.iloc[[4, 6], :].

Is there a pandas or numpy way to do this without explicitly looping over pairs?


Answer comparison

The solution using broadcasting is both clean and fast, as well as scaling very well.

def with_set_index(df, pairs):
    return df.set_index(['a','b']).loc[pairs].dropna()

def with_tuple_isin(df, pairs):
    return df[df[['a','b']].apply(tuple,1).isin(pairs)]

def with_array_views(df, pairs):
    def view1D(a, b): # a, b are arrays
        a = np.ascontiguousarray(a)
        b = np.ascontiguousarray(b)
        void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
        return a.view(void_dt).ravel(), b.view(void_dt).ravel()

    A, B = view1D(df[['a','b']].values, np.asarray(pairs))
    return df[np.isin(A, B)]

def with_broadcasting(df, pairs):
    return df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]

%timeit with_set_index(df, pairs)
# 7.35 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit with_tuple_isin(df, pairs)
# 1.89 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit with_array_views(df, pairs)
# 917 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit with_broadcasting(df, pairs)
# 879 µs ± 8.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

Views: 2003

Answers (3)

Rajat Jain
Rajat Jain

Reputation: 2032

Try this:

df.set_index(['a','b']).loc[pairs].dropna()

Upvotes: 1

Divakar
Divakar

Reputation: 221544

A vectorized one based on array-views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

A,B = view1D(df[['a','b']].values,np.asarray(pairs))
out = df[np.isin(A,B)]

Output for given sample -

In [263]: out
Out[263]: 
   a  b  c
4  4  4  v
6  6  6  t

If you are looking for a compact/clean version, we can also leverage broadcasting -

In [269]: df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]
Out[269]: 
   a  b  c
4  4  4  v
6  6  6  t

Upvotes: 2

BENY
BENY

Reputation: 323226

tuple with isin

df[df[['a','b']].apply(tuple,1).isin(pairs)]
Out[686]: 
   a  b  c
4  4  4  v
6  6  6  t

Upvotes: 6

Related Questions