Reputation: 148
I am trying to perform entity matching for the first time and want to "get rid" of the obvious matches first, so I can focus working with the fuzzy cases. I have a dataset of almost 600.000 entries containing information about clothes.
What I need is all different prices of the suppliers that have the same id, color and size.
Here is an example:
df = pd.DataFrame([{'product_id': 1, 'price': 10, 'supplier': 'A', 'color': "red", "size": "xl"},
{'product_id': 2, 'price': 7, 'supplier': 'A', 'color': "blue", "size": "m"},
{'product_id': 2, 'price': 7, 'supplier': 'B', 'color': "blue", "size": "m"},
{'product_id': 1, 'price': 11, 'supplier': 'C', 'color': "red", "size": "xl"},
{'product_id': 3, 'price': 2, 'supplier': 'C', 'color': "red", "size": "s"},
{'product_id': 1, 'price': 3, 'supplier': 'A', 'color': "blue", "size": "XL"}]) #EDIT: added
which looks like:
product_id price supplier color size
0 1 10 A red xl
1 2 7 A blue m
2 2 7 B blue m
3 1 11 C red xl
4 3 2 C red s
EDIT:
5 1 3 A blue xl
EDIT Remark: The old example missed a key aspect: There are products that have the same product_id but can have multiple color variations. Think about product 1 being a standart white T-shirt that I can order in different sizes and colors from multiple suppliers. I also realized that I don't need the supplier, see below:
but what I need is:
product_id color size price_a price_b price_c
0 1 red xl 10 - 11
1 1 blue xl 3 - -
2 2 blue m 7 7 -
3 3 red s - - 2
I know I have to groupby "product_id", "color" and "size" in combination with the aggregate function. But I dont know how I can make pandas create the new columns price_a, price_b and price_c.
I feel like this should be very simple, but I can not get it to work. Any help is much appreciated!
Upvotes: 2
Views: 894
Reputation: 2243
Following steps worked like a magic to achieve the output:
product_id
and color
using groupby()
and get first item of color, size and list of values of price and supplier.pd.Series.explode()
DatFrame.pivot()
; product_id
and color
as index, supplier
as columns and price
as values.pd.concat()
and drop the unnecessary columns using DataFrame.drop()
df = df.groupby(['product_id','color']).agg({'size':'first','price':list,'supplier':list}).reset_index()
price_details = df.set_index(['product_id',"color","size"]).apply(pd.Series.explode).reset_index()
price_details =pd.pivot_table(price_details,index=["product_id","color"], columns="supplier", values="price",aggfunc= 'first').add_prefix("price_").fillna("-").reset_index()
out = pd.concat([df,price_details],axis=1)
out.drop(columns = ["price"], inplace=True)
out.supplier = out.supplier.apply(lambda x: x[0])
print(out)
product_id color size supplier product_id color price_A price_B price_C
0 1 blue XL A 1 blue 3 - -
1 1 red xl A 1 red 10 - 11
2 2 blue m A 2 blue 7 7 -
3 3 red s C 3 red - - 2
Upvotes: 1
Reputation: 1875
My idea is two concatenate two dataframes - one dataframe without duplicates and dataframes where we have prices for each type. I believe it could be done by using fewer lines, but I will give you my solution as there are no other:
pd.concat([
(
df.drop_duplicates(subset=['product_id', 'color', 'size'], keep='first')
.drop(columns='price')
.reset_index(drop=True)
)
,
(
df.groupby(['product_id', 'supplier'])['price']
.apply(lambda x: list(x)[0])
.to_frame()
.unstack()
.droplevel(0, axis=1)
.reset_index()
.drop('product_id', axis=1)
.rename(columns={'A': 'price_a', 'B': 'price_b', 'C': 'price_c'})
)
],
axis=1
).fillna('-')
Output:
product_id supplier color size price_a price_b price_c
0 1 A red xl 10.0 - 11.0
1 2 A blue m 7.0 7.0 -
2 3 C red s - - 2.0
Upvotes: 1