Get rows with same value combination in seperate columns using pandas

Question

I am trying to perform entity matching for the first time and want to "get rid" of the obvious matches first, so I can focus working with the fuzzy cases. I have a dataset of almost 600.000 entries containing information about clothes.

What I need is all different prices of the suppliers that have the same id, color and size.

Here is an example:

df = pd.DataFrame([{'product_id': 1, 'price': 10, 'supplier': 'A', 'color': "red", "size": "xl"},
                       {'product_id': 2, 'price': 7, 'supplier': 'A', 'color': "blue", "size": "m"},
                       {'product_id': 2, 'price': 7, 'supplier': 'B', 'color': "blue", "size": "m"},
                       {'product_id': 1, 'price': 11, 'supplier': 'C', 'color': "red", "size": "xl"},
                       {'product_id': 3, 'price': 2, 'supplier': 'C', 'color': "red", "size": "s"},
                       {'product_id': 1, 'price': 3, 'supplier': 'A', 'color': "blue", "size": "XL"}]) #EDIT: added

which looks like:

   product_id  price supplier color size
0           1     10        A   red   xl
1           2      7        A  blue    m
2           2      7        B  blue    m
3           1     11        C   red   xl
4           3      2        C   red    s
EDIT:
5           1      3        A  blue   xl

EDIT Remark: The old example missed a key aspect: There are products that have the same product_id but can have multiple color variations. Think about product 1 being a standart white T-shirt that I can order in different sizes and colors from multiple suppliers. I also realized that I don't need the supplier, see below:

but what I need is:

   product_id   color  size  price_a   price_b  price_c
0           1     red    xl    10        -        11
1           1    blue    xl     3        -        -
2           2    blue     m     7        7        -
3           3     red     s     -        -        2

I know I have to groupby "product_id", "color" and "size" in combination with the aggregate function. But I dont know how I can make pandas create the new columns price_a, price_b and price_c.

I feel like this should be very simple, but I can not get it to work. Any help is much appreciated!

Hamza usman ghani · Accepted Answer

Following steps worked like a magic to achieve the output:

Make groups w.r.t product_id and color using groupby() and get first item of color, size and list of values of price and supplier.
explode the price and supplier columns using pd.Series.explode()
Pivot the table using DatFrame.pivot() ; product_id and color as index, supplier as columns and price as values.
merge the pivoted table with initial table using pd.concat() and drop the unnecessary columns using DataFrame.drop()

df = df.groupby(['product_id','color']).agg({'size':'first','price':list,'supplier':list}).reset_index()
price_details = df.set_index(['product_id',"color","size"]).apply(pd.Series.explode).reset_index()
price_details =pd.pivot_table(price_details,index=["product_id","color"], columns="supplier", values="price",aggfunc= 'first').add_prefix("price_").fillna("-").reset_index()
out = pd.concat([df,price_details],axis=1)
out.drop(columns = ["price"], inplace=True)
out.supplier = out.supplier.apply(lambda x: x[0])
print(out)

   product_id color size supplier  product_id color price_A price_B price_C
0           1  blue   XL        A           1  blue       3       -       -
1           1   red   xl        A           1   red      10       -      11
2           2  blue    m        A           2  blue       7       7       -
3           3   red    s        C           3   red       -       -       2

Get rows with same value combination in seperate columns using pandas

Answers (2)

Related Questions