Saeed Esmaili
Saeed Esmaili

Reputation: 843

Getting a subset of dataframe based on values of another dataframe (Python)

I have two dataframes. Each row in Dataframe A is a package of products, and Dataframe B consists of product ids and their sellers' ids.

Dataframe A:

package_name | product_1 | product_2 | product_3 | product_4
package a    |     12    |     15    |    NaN    |    NaN
package b    |     17    |     16    |    14     |    NaN
package c    |     12    |     11    |    17     |    19

Dataframe B:

product_id | seller_id
12         | seller1
15         | seller1
12         | seller2
15         | seller2
17         | seller3
16         | seller3
14         | seller3

(Each product can have multiple sellers, and each seller has multiple products.)

I want to know which sellers have products of packages (based on Dataframe A). This is what expected:

Dataframe C:

package_name | product_1 | product_2 | product_3 | product_4 | seller_id
package a    |     12    |     15    |    NaN    |    NaN    | seller1
package a    |     12    |     15    |    NaN    |    NaN    | seller2
package b    |     17    |     16    |    14     |    NaN    | seller3

Both seller1 and seller2 have "all" products of package a, and seller3 has "all" products of package b.

How can I achieve Dataframe C?

Upvotes: 1

Views: 460

Answers (1)

jezrael
jezrael

Reputation: 863751

Idea is use DataFrame.merge with right join by helper DataFrame created by match subset of sets:

print (B)
   product_id seller_id
0          12   seller1
1          15   seller1
2          12   seller2
3          15   seller2
4          17   seller3
5          16   seller3
6          14   seller3
7          12   seller4
8          15   seller4
9          14   seller4

A = A.set_index('package_name') 
f = lambda x: set([int(y) for y in x if y == y])
a = A.apply(f, axis=1).to_dict()
#print (a)

b = B.groupby('seller_id')['product_id'].apply(set).to_dict()
#print (b)

c = [(k, k1) for k, v in a.items() for k1,v1 in b.items() if v.issubset(v1)]
#print (c)

C1 = pd.DataFrame(c, columns=['package_name','seller_id'])
print (C1)
  package_name seller_id
0    package a   seller1
1    package a   seller2
2    package a   seller4
3    package b   seller3

C = A.merge(C1, on='package_name', how='right')
print (C)
  package_name  product_1  product_2  product_3  product_4 seller_id
0    package a         12         15        NaN        NaN   seller1
1    package a         12         15        NaN        NaN   seller2
2    package a         12         15        NaN        NaN   seller4
3    package b         17         16       14.0        NaN   seller3

Upvotes: 2

Related Questions