Nemra Khalil
Nemra Khalil

Reputation: 111

How to remove duplicate entries within a column row in pandas?

I have this dataframe enter image description here

I want to remove the duplicate entries within a column row and return the unique values only. I tried set(df['unit]) but it didn't help at all. I want to only have (New Ward, Adult Ward)/(Pediatric Ward) for example. Would you please provide me how to fix this error?

Upvotes: 2

Views: 145

Answers (1)

fsimonjetz
fsimonjetz

Reputation: 5802

Looks like your cells contain comma-separated strings. In that case, you can

  • split the entries at ',\s*' (comma plus optional whitespace),
  • convert them to sets,
  • and rejoin them to comma-delimited strings:
df['unit'] = df.unit.str.split(',\s*').apply(set).str.join(', ')

However, there are downsides to this way of encoding values. (For example, 'Adult Ward, New WARD' and 'New WARD, Adult Ward' would be two distinct values when in reality, they are identical). Consider handling each ward as a separate column, e.g.,

for w in ['Pediatric Ward', 'New Ward', 'Adult Ward']:
    df[w] = df.unit.str.contains(w, case=False)

will result in separate columns for each ward, which will be much easier to work with down the line, like getting combinations of wards:

                   unit  Pediatric Ward  New Ward  Adult Ward
0        Pediatric Ward            True     False       False
1  Adult Ward, New WARD           False      True        True
2              New WARD           False      True       False
3        Pediatric Ward            True     False       False
4  Adult Ward, New WARD           False      True        True

Upvotes: 3

Related Questions