mayool
mayool

Reputation: 148

Get rows with same value combination in seperate columns using pandas

I am trying to perform entity matching for the first time and want to "get rid" of the obvious matches first, so I can focus working with the fuzzy cases. I have a dataset of almost 600.000 entries containing information about clothes.

What I need is all different prices of the suppliers that have the same id, color and size.

Here is an example:

df = pd.DataFrame([{'product_id': 1, 'price': 10, 'supplier': 'A', 'color': "red", "size": "xl"},
                       {'product_id': 2, 'price': 7, 'supplier': 'A', 'color': "blue", "size": "m"},
                       {'product_id': 2, 'price': 7, 'supplier': 'B', 'color': "blue", "size": "m"},
                       {'product_id': 1, 'price': 11, 'supplier': 'C', 'color': "red", "size": "xl"},
                       {'product_id': 3, 'price': 2, 'supplier': 'C', 'color': "red", "size": "s"},
                       {'product_id': 1, 'price': 3, 'supplier': 'A', 'color': "blue", "size": "XL"}]) #EDIT: added

which looks like:

   product_id  price supplier color size
0           1     10        A   red   xl
1           2      7        A  blue    m
2           2      7        B  blue    m
3           1     11        C   red   xl
4           3      2        C   red    s
EDIT:
5           1      3        A  blue   xl

EDIT Remark: The old example missed a key aspect: There are products that have the same product_id but can have multiple color variations. Think about product 1 being a standart white T-shirt that I can order in different sizes and colors from multiple suppliers. I also realized that I don't need the supplier, see below:

but what I need is:

   product_id   color  size  price_a   price_b  price_c
0           1     red    xl    10        -        11
1           1    blue    xl     3        -        -
2           2    blue     m     7        7        -
3           3     red     s     -        -        2

I know I have to groupby "product_id", "color" and "size" in combination with the aggregate function. But I dont know how I can make pandas create the new columns price_a, price_b and price_c.

I feel like this should be very simple, but I can not get it to work. Any help is much appreciated!

Upvotes: 2

Views: 894

Answers (2)

Hamza usman ghani
Hamza usman ghani

Reputation: 2243

Following steps worked like a magic to achieve the output:

  1. Make groups w.r.t product_id and color using groupby() and get first item of color, size and list of values of price and supplier.
  2. explode the price and supplier columns using pd.Series.explode()
  3. Pivot the table using DatFrame.pivot() ; product_id and color as index, supplier as columns and price as values.
  4. merge the pivoted table with initial table using pd.concat() and drop the unnecessary columns using DataFrame.drop()
df = df.groupby(['product_id','color']).agg({'size':'first','price':list,'supplier':list}).reset_index()
price_details = df.set_index(['product_id',"color","size"]).apply(pd.Series.explode).reset_index()
price_details =pd.pivot_table(price_details,index=["product_id","color"], columns="supplier", values="price",aggfunc= 'first').add_prefix("price_").fillna("-").reset_index()
out = pd.concat([df,price_details],axis=1)
out.drop(columns = ["price"], inplace=True)
out.supplier = out.supplier.apply(lambda x: x[0])
print(out)

   product_id color size supplier  product_id color price_A price_B price_C
0           1  blue   XL        A           1  blue       3       -       -
1           1   red   xl        A           1   red      10       -      11
2           2  blue    m        A           2  blue       7       7       -
3           3   red    s        C           3   red       -       -       2

Upvotes: 1

Arkadiusz
Arkadiusz

Reputation: 1875

My idea is two concatenate two dataframes - one dataframe without duplicates and dataframes where we have prices for each type. I believe it could be done by using fewer lines, but I will give you my solution as there are no other:

pd.concat([
    (
        df.drop_duplicates(subset=['product_id', 'color', 'size'], keep='first')
          .drop(columns='price')
          .reset_index(drop=True)
    )
    , 
    (
        df.groupby(['product_id', 'supplier'])['price']
          .apply(lambda x: list(x)[0])
          .to_frame()
          .unstack()
          .droplevel(0, axis=1)
          .reset_index()
          .drop('product_id', axis=1)
          .rename(columns={'A': 'price_a', 'B': 'price_b', 'C': 'price_c'})
     )
], 
    axis=1
).fillna('-')

Output:

    product_id  supplier    color   size    price_a  price_b    price_c
0            1         A      red     xl       10.0        -       11.0
1            2         A     blue      m        7.0      7.0          -
2            3         C      red      s          -        -        2.0

Upvotes: 1

Related Questions