How to remove duplicate entries within a column row in pandas?

Question

I have this dataframe

I want to remove the duplicate entries within a column row and return the unique values only. I tried set(df['unit]) but it didn't help at all. I want to only have (New Ward, Adult Ward)/(Pediatric Ward) for example. Would you please provide me how to fix this error?

fsimonjetz · Accepted Answer

Looks like your cells contain comma-separated strings. In that case, you can

split the entries at ',\s*' (comma plus optional whitespace),
convert them to sets,
and rejoin them to comma-delimited strings:

df['unit'] = df.unit.str.split(',\s*').apply(set).str.join(', ')

However, there are downsides to this way of encoding values. (For example, 'Adult Ward, New WARD' and 'New WARD, Adult Ward' would be two distinct values when in reality, they are identical). Consider handling each ward as a separate column, e.g.,

for w in ['Pediatric Ward', 'New Ward', 'Adult Ward']:
    df[w] = df.unit.str.contains(w, case=False)

will result in separate columns for each ward, which will be much easier to work with down the line, like getting combinations of wards:

                   unit  Pediatric Ward  New Ward  Adult Ward
0        Pediatric Ward            True     False       False
1  Adult Ward, New WARD           False      True        True
2              New WARD           False      True       False
3        Pediatric Ward            True     False       False
4  Adult Ward, New WARD           False      True        True

How to remove duplicate entries within a column row in pandas?

Answers (1)

Related Questions