Ganitph
Ganitph

Reputation: 123

Efficiently search for combinations of list values in a pandas DataFrame

I'm iterating over multiple pandas data frames where on each iteration I extract 2 lists. Then, for each 2 lists combination, I need to find the row in another Data Frame where each combination of all values in the 2 lists appears.

Simplified example:

The df looks like this (any D_val P_val may appear only once):

D_val | P_val | D_PCode  
A     | 1     | 25  
D     | 1     | 21  
C     | 2     | 32  
B     | 4     | 35  
D     | 2     | 126  
B     | 1     | 3  
C     | 1     | 312  

For a single iteration with the following lists -

list1 = [1,2]  
list2 = [A,B,C]  

I expect to get the following list:
[25, 3, 312, 32]

Explanation:

for [A,1] - 25  
for [A,2] - nothing  
For [B,1] - 3  

etc.

I implemented it using nested loops, but with multiple iterations, large data frame and long lists it takes far too long.

Any suggestions how to efficiently implement it? I'm happy to use other data structures if needed.

Upvotes: 3

Views: 1408

Answers (3)

Dani Mesejo
Dani Mesejo

Reputation: 61910

You can use itertools.product to generate all possible values, then use isin:

from itertools import product

import pandas as pd

data = [['A', 1, 25],
        ['D', 1, 21],
        ['C', 2, 32],
        ['B', 4, 35],
        ['D', 2, 126],
        ['B', 1, 3],
        ['C', 1, 312]]

df = pd.DataFrame(data=data, columns=['D_val', 'P_val', 'D_PCode'])

list1 = [1, 2]
list2 = ['A', 'B', 'C']
lookup = set(product(list2, list1))

mask = df[['D_val', 'P_val']].apply(tuple, axis=1).isin(lookup)
result = df[mask].D_PCode

print(result)

Output

0     25
2     32
5      3
6    312
Name: D_PCode, dtype: int64

Or just use isin directly on both columns, for instance:

list1 = [1, 2]
list2 = ['A', 'B', 'C']

result = df[df.D_val.isin(list2) & df.P_val.isin(list1)].D_PCode
print(result)

Output

0     25
2     32
5      3
6    312
Name: D_PCode, dtype: int64

Upvotes: 2

cs95
cs95

Reputation: 402553

This is a MultiIndex problem, and is best solved with reindex.

df = df.set_index(['D_val', 'P_val'])
idx = pd.MultiIndex.from_product([list2, list1])

df.reindex(idx)

     D_PCode
A 1     25.0
  2      NaN
B 1      3.0
  2      NaN
C 1    312.0
  2     32.0

If you want a dictionary of the matches, use reindex + Series.to_dict:

df.reindex(idx).D_PCode.to_dict()

{('A', 1): 25.0,
 ('A', 2): nan,
 ('B', 1): 3.0,
 ('B', 2): nan,
 ('C', 1): 312.0,
 ('C', 2): 32.0}

If you only want the rows that exist, use intersection for performantly computating the index of the intersection:

df.loc[df.index.intersection(idx)]

     D_PCode
A 1       25
B 1        3
C 1      312
  2       32

Upvotes: 1

BENY
BENY

Reputation: 323276

We just using isin twice

df1=df.loc[df.D_val.isin(list2)&df.P_val.isin(list1)]
df1
Out[211]: 
  D_val  P_val  D_PCode
0     A      1       25
2     C      2       32
5     B      1        3
6     C      1      312

Upvotes: 2

Related Questions