wingsoficarus116
wingsoficarus116

Reputation: 449

transform pandas dataframe index to be based on dataframe values, while changing values to booleans

I have two datasets of company names with naming variations between the two (ex: Merck & Co. vs Merck & Co vs Merck & Co., Inc.)

I am trying to create a "map" where all the unique variants of "Merck" match to "Merck & Co."

My current approach is to create a dictionary, convert to a dataframe, and ideally perform this transformation described below:

    Company 1        Company 2         Company 3
0   Co1 variant 0    Co1 variant 0     Co3 variant 0
1   Co1 variant 1    Co2 variant 1     
2   Co1 variant 2       

Should become:

                  Company 1       Company 2         Company 3
Co1 variant 0     True            True              False
Co1 variant 1     True            False             False
Co1 variant 2     True            False             False
Co2 variant 0     False           True              False
Co2 variant 1     False           True              False
Co3 variant 0     False           True              True

From here if each row only has 1 True value in the row, I would match this naming variant to the column header, otherwise I would match it to itself; allowing the final data set to look like

    Variants        Matched Name
0   Co1 variant 0   Co1 variant 0
1   Co1 variant 1   Company 1
2   Co1 variant 2   Company 1
3   Co2 variant 1   Company 2
4   Co3 variant 0   Company 3

Thank you in advance for any input, I'm also happy to clarify further if the question is not clear enough.

Upvotes: 1

Views: 158

Answers (1)

akuiper
akuiper

Reputation: 215107

You can stack the original data frame and then group by the row id number and the variants, check how many rows you have for each group, if there are only one row, select the company otherwise select the variants itself:

(df.stack().rename("Variants").rename_axis(("RowId", "Company")).reset_index()
 .groupby(["RowId", "Variants"])
 .apply(lambda g: g.Company.iloc[0] if g.Company.size == 1 else g.Variants.iloc[0])
 .rename("Matched Names").reset_index().drop("RowId", axis = 1))

enter image description here

Upvotes: 1

Related Questions