TomKunes
TomKunes

Reputation: 37

Figuring out intersection of two different data frames in pandas

I have two data frames. One includes the sales for Product A. The other has the sales for Product B.

I'd like to figure out which Customers only bought Product A, which customers only bought Product B, and which customers bought both.

DF 1:

Cust_ID   Industry    Product_Type
ABC       Manuf       Product A
ACB       Tech        Product A
BAD       Manuf       Product A
DAC       Tech        Product A
DEF       Retail      Product A

DF 2:

Cust_ID   Industry    Product_Type
DAC       Prof Serv   Product B
BAD       Manuf       Product B
BCA       Tech        Product B
FED       Trade       Product B
DEF       Retail      Product B

I'd like the output to be:

BothDF:

Cust_ID   Industry    
BAD       Manuf
DEF       Retail

Only_Prod_A_DF:

Cust_ID   Industry
ABC       Manuf
ACB       Tech
DAC       Tech

Only_Prod_B_DF:

Cust_ID   Industry
DAC       Prof Serv
BCA       Tech
FED       Trade

I've figured out the intersection between the two data frames with the following code:

Sales = pd.merge(Sales_Prod_A, Sales_Prod_B, how='inner', on=['Cust_ID'])

Any thoughts on the customers who bought one or the other? Thank you!

Upvotes: 1

Views: 44

Answers (1)

jezrael
jezrael

Reputation: 863801

Add indicator=True and suffixes parameters to merge in first step:

Sales = pd.merge(Sales_Prod_A, 
                 Sales_Prod_B, 
                 how='outer', 
                 on=['Cust_ID'], 
                 indicator=True,
                 suffixes=('','_'))
print (Sales)
  Cust_ID Industry Product_Type  Industry_ Product_Type_      _merge
0     ABC    Manuf    Product A        NaN           NaN   left_only
1     ACB     Tech    Product A        NaN           NaN   left_only
2     BAD    Manuf    Product A      Manuf     Product B        both
3     DAC     Tech    Product A  Prof Serv     Product B        both
4     DEF   Retail    Product A     Retail     Product B        both
5     BCA      NaN          NaN       Tech     Product B  right_only
6     FED      NaN          NaN      Trade     Product B  right_only

And then filter by DataFrame.loc with boolean indexing and Series.eq, last rename columns names for remove _x and _y:

BothDF = Sales.loc[Sales['_merge'].eq('both'), ['Cust_ID','Industry']]
print (BothDF)
  Cust_ID Industry
2     BAD    Manuf
3     DAC     Tech
4     DEF   Retail

Only_Prod_A_DF = Sales.loc[Sales['_merge'].eq('left_only'), ['Cust_ID','Industry']]
print (Only_Prod_A_DF)
  Cust_ID Industry
0     ABC    Manuf
1     ACB     Tech


Only_Prod_B_DF = (Sales.loc[Sales['_merge'].eq('right_only'), ['Cust_ID','Industry_']]
                       .rename(columns=lambda x: x.strip('_')))
print (Only_Prod_B_DF)
  Cust_ID Industry
5     BCA     Tech
6     FED    Trade

Upvotes: 4

Related Questions