Reputation: 111
I want to remove the duplicate entries within a column row and return the unique values only. I tried set(df['unit])
but it didn't help at all. I want to only have (New Ward, Adult Ward)/(Pediatric Ward)
for example.
Would you please provide me how to fix this error?
Upvotes: 2
Views: 145
Reputation: 5802
Looks like your cells contain comma-separated strings. In that case, you can
',\s*'
(comma plus optional whitespace),df['unit'] = df.unit.str.split(',\s*').apply(set).str.join(', ')
However, there are downsides to this way of encoding values. (For example, 'Adult Ward, New WARD'
and 'New WARD, Adult Ward'
would be two distinct values when in reality, they are identical). Consider handling each ward as a separate column, e.g.,
for w in ['Pediatric Ward', 'New Ward', 'Adult Ward']:
df[w] = df.unit.str.contains(w, case=False)
will result in separate columns for each ward, which will be much easier to work with down the line, like getting combinations of wards:
unit Pediatric Ward New Ward Adult Ward
0 Pediatric Ward True False False
1 Adult Ward, New WARD False True True
2 New WARD False True False
3 Pediatric Ward True False False
4 Adult Ward, New WARD False True True
Upvotes: 3