Reputation: 25
I have a large dataframe with a format like this:
| ID | A | B |
| -------- | ----------------- | ----------------------- |
| 0 | Tenure: Leasehold | Modern;Tenure: Leasehold|
| 1 | First Floor | Refurbished |
| 2 | NaN | Modern |
| 3 | First Floor | NaN |
I want to remove redundancies between columns A and B before merging them. So I would like to check if the value in column A is contained in column B: if yes, column A should take the value of column B, if no, the value of column A should remain the same.
I tried the following lambda function:
df['A'] = df['A'].apply(lambda x: df.B if df.A in df.B else df.A)
But I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Then I tried with the np.where method like this:
df['A'] = np.where((df.A.values in df.B.values), df.B, df.A)
I can run the code, but it returns false for all the columns, so I don't get any modification in my DataFrame.
If I run the following code, it returns True however, so I know that the problem does not come from the data:
df.loc[0, 'A'] in df.loc[0, 'B']
I tried to modify this code and use it like that:
df['A'] = np.where((df.loc[:, 'A'] in df.loc[:, 'B']), df.B, df.A)
But then I get the same TypeError as above.
How can I solve this issue?
Upvotes: 1
Views: 45
Reputation: 195408
df["A"] = df.apply(lambda x: x["B"] if x["A"] in x["B"] else x["A"], axis=1)
print(df)
Prints:
ID A B
0 0 Modern;Tenure: Leasehold Modern;Tenure: Leasehold
1 1 First Floor Refurbished
EDIT: To handle NaN
s:
df["A"] = df.apply(
lambda x: x["B"]
if pd.notna(x["A"]) and pd.notna(x["B"]) and x["A"] in x["B"]
else x["A"],
axis=1,
)
print(df)
Prints:
ID A B
0 0 Modern;Tenure: Leasehold Modern;Tenure: Leasehold
1 1 First Floor Refurbished
2 2 NaN Modern
3 3 First Floor NaN
If you want to fill NaN
s in column "A"
:
df.loc[df["A"].isna(), "A"] = df.loc[df["A"].isna(), "B"]
print(df)
Prints:
ID A B
0 0 Modern;Tenure: Leasehold Modern;Tenure: Leasehold
1 1 First Floor Refurbished
2 2 Modern Modern
3 3 First Floor NaN
Upvotes: 2
Reputation: 75080
I would do a list comprehension with zip which would be quite fast when compared to pandas apply as the size of the dataframe increases:
df["A"] = [b if a in b else a for a,b in zip(df['A'],df['B'])]
print(df)
ID A B
0 0 Modern;Tenure: Leasehold Modern;Tenure: Leasehold
1 1 First Floor Refurbished
Upvotes: 1