Reputation: 1293
I've got a pandas
dataframe consisting of address fields for columns. My problem is that across two columns, I have duplicate cell values across the row. Does anybody know how I can conditionally change the value for one column when a duplicate is found across the two columns? Ideally I would like to keep one value, and set the other to np.nan
.
Here is a test case:
import pandas as pd
test = pd.read_json('{"housename":{"16":null,"17":null,"18":null},"name":{"16":"Shoecare","17":"33","18":"33A"},"house_number":{"16":"32","17":"33","18":"33A"},"street":{"16":"Carfax","17":"Carfax","18":"Carfax"},"city":{"16":"Horsham","17":"Horsham","18":"Horsham"},"postcode":{"16":"RH12 1EE","17":"RH12 1EE","18":"RH12 1EE"}}')
city house_number housename name postcode street
16 Horsham 32 NaN Shoecare RH12 1EE Carfax
17 Horsham 33 NaN 33 RH12 1EE Carfax
18 Horsham 33A NaN 33A RH12 1EE Carfax
On the test case, I've played with test.duplicated(subset=['house_number', 'name'])
, but it won't identify duplicate values in the house_number
and name
columns.
Does anybody have any suggestions of how to firstly identify duplicated cells across two columns, and then set one value to np.nan
?
Desired output:
housename name house_number street city postcode
16 NaN Shoecare 32 Carfax Horsham RH12 1EE
17 NaN NaN 33 Carfax Horsham RH12 1EE
18 NaN NaN 33A Carfax Horsham RH12 1EE
Upvotes: 1
Views: 42
Reputation: 12417
If the 2 columns are house_number
and name
, you can do in this way:
test['name'] = np.where((test['house_number'] == test['name']), np.nan, test['name'])
Output:
city house_number housename name postcode street
16 Horsham 32 NaN Shoecare RH12 1EE Carfax
17 Horsham 33 NaN NaN RH12 1EE Carfax
18 Horsham 33A NaN NaN RH12 1EE Carfax
Upvotes: 2