Reputation: 843
I have two dataframes. Each row in Dataframe A is a package of products, and Dataframe B consists of product ids and their sellers' ids.
Dataframe A:
package_name | product_1 | product_2 | product_3 | product_4
package a | 12 | 15 | NaN | NaN
package b | 17 | 16 | 14 | NaN
package c | 12 | 11 | 17 | 19
Dataframe B:
product_id | seller_id
12 | seller1
15 | seller1
12 | seller2
15 | seller2
17 | seller3
16 | seller3
14 | seller3
(Each product can have multiple sellers, and each seller has multiple products.)
I want to know which sellers have products of packages (based on Dataframe A). This is what expected:
Dataframe C:
package_name | product_1 | product_2 | product_3 | product_4 | seller_id
package a | 12 | 15 | NaN | NaN | seller1
package a | 12 | 15 | NaN | NaN | seller2
package b | 17 | 16 | 14 | NaN | seller3
Both seller1 and seller2 have "all" products of package a, and seller3 has "all" products of package b.
How can I achieve Dataframe C?
Upvotes: 1
Views: 460
Reputation: 863751
Idea is use DataFrame.merge
with right join by helper DataFrame created by match subset of sets:
print (B)
product_id seller_id
0 12 seller1
1 15 seller1
2 12 seller2
3 15 seller2
4 17 seller3
5 16 seller3
6 14 seller3
7 12 seller4
8 15 seller4
9 14 seller4
A = A.set_index('package_name')
f = lambda x: set([int(y) for y in x if y == y])
a = A.apply(f, axis=1).to_dict()
#print (a)
b = B.groupby('seller_id')['product_id'].apply(set).to_dict()
#print (b)
c = [(k, k1) for k, v in a.items() for k1,v1 in b.items() if v.issubset(v1)]
#print (c)
C1 = pd.DataFrame(c, columns=['package_name','seller_id'])
print (C1)
package_name seller_id
0 package a seller1
1 package a seller2
2 package a seller4
3 package b seller3
C = A.merge(C1, on='package_name', how='right')
print (C)
package_name product_1 product_2 product_3 product_4 seller_id
0 package a 12 15 NaN NaN seller1
1 package a 12 15 NaN NaN seller2
2 package a 12 15 NaN NaN seller4
3 package b 17 16 14.0 NaN seller3
Upvotes: 2