Reputation: 123
I'm iterating over multiple pandas data frames where on each iteration I extract 2 lists. Then, for each 2 lists combination, I need to find the row in another Data Frame where each combination of all values in the 2 lists appears.
Simplified example:
The df looks like this (any D_val P_val may appear only once):
D_val | P_val | D_PCode
A | 1 | 25
D | 1 | 21
C | 2 | 32
B | 4 | 35
D | 2 | 126
B | 1 | 3
C | 1 | 312
For a single iteration with the following lists -
list1 = [1,2]
list2 = [A,B,C]
I expect to get the following list:
[25, 3, 312, 32]
Explanation:
for [A,1] - 25
for [A,2] - nothing
For [B,1] - 3
etc.
I implemented it using nested loops, but with multiple iterations, large data frame and long lists it takes far too long.
Any suggestions how to efficiently implement it? I'm happy to use other data structures if needed.
Upvotes: 3
Views: 1408
Reputation: 61910
You can use itertools.product to generate all possible values, then use isin:
from itertools import product
import pandas as pd
data = [['A', 1, 25],
['D', 1, 21],
['C', 2, 32],
['B', 4, 35],
['D', 2, 126],
['B', 1, 3],
['C', 1, 312]]
df = pd.DataFrame(data=data, columns=['D_val', 'P_val', 'D_PCode'])
list1 = [1, 2]
list2 = ['A', 'B', 'C']
lookup = set(product(list2, list1))
mask = df[['D_val', 'P_val']].apply(tuple, axis=1).isin(lookup)
result = df[mask].D_PCode
print(result)
Output
0 25
2 32
5 3
6 312
Name: D_PCode, dtype: int64
Or just use isin directly on both columns, for instance:
list1 = [1, 2]
list2 = ['A', 'B', 'C']
result = df[df.D_val.isin(list2) & df.P_val.isin(list1)].D_PCode
print(result)
Output
0 25
2 32
5 3
6 312
Name: D_PCode, dtype: int64
Upvotes: 2
Reputation: 402553
This is a MultiIndex
problem, and is best solved with reindex
.
df = df.set_index(['D_val', 'P_val'])
idx = pd.MultiIndex.from_product([list2, list1])
df.reindex(idx)
D_PCode
A 1 25.0
2 NaN
B 1 3.0
2 NaN
C 1 312.0
2 32.0
If you want a dictionary of the matches, use reindex
+ Series.to_dict
:
df.reindex(idx).D_PCode.to_dict()
{('A', 1): 25.0,
('A', 2): nan,
('B', 1): 3.0,
('B', 2): nan,
('C', 1): 312.0,
('C', 2): 32.0}
If you only want the rows that exist, use intersection
for performantly computating the index of the intersection:
df.loc[df.index.intersection(idx)]
D_PCode
A 1 25
B 1 3
C 1 312
2 32
Upvotes: 1
Reputation: 323276
We just using isin
twice
df1=df.loc[df.D_val.isin(list2)&df.P_val.isin(list1)]
df1
Out[211]:
D_val P_val D_PCode
0 A 1 25
2 C 2 32
5 B 1 3
6 C 1 312
Upvotes: 2