Reputation: 449
I have two datasets of company names with naming variations between the two (ex: Merck & Co. vs Merck & Co vs Merck & Co., Inc.)
I am trying to create a "map" where all the unique variants of "Merck" match to "Merck & Co."
My current approach is to create a dictionary, convert to a dataframe, and ideally perform this transformation described below:
Company 1 Company 2 Company 3
0 Co1 variant 0 Co1 variant 0 Co3 variant 0
1 Co1 variant 1 Co2 variant 1
2 Co1 variant 2
Should become:
Company 1 Company 2 Company 3
Co1 variant 0 True True False
Co1 variant 1 True False False
Co1 variant 2 True False False
Co2 variant 0 False True False
Co2 variant 1 False True False
Co3 variant 0 False True True
From here if each row only has 1 True
value in the row, I would match this naming variant to the column header, otherwise I would match it to itself; allowing the final data set to look like
Variants Matched Name
0 Co1 variant 0 Co1 variant 0
1 Co1 variant 1 Company 1
2 Co1 variant 2 Company 1
3 Co2 variant 1 Company 2
4 Co3 variant 0 Company 3
Thank you in advance for any input, I'm also happy to clarify further if the question is not clear enough.
Upvotes: 1
Views: 158
Reputation: 215107
You can stack the original data frame and then group by the row id number and the variants, check how many rows you have for each group, if there are only one row, select the company otherwise select the variants itself:
(df.stack().rename("Variants").rename_axis(("RowId", "Company")).reset_index()
.groupby(["RowId", "Variants"])
.apply(lambda g: g.Company.iloc[0] if g.Company.size == 1 else g.Variants.iloc[0])
.rename("Matched Names").reset_index().drop("RowId", axis = 1))
Upvotes: 1