mickeywise
mickeywise

Reputation: 67

How do I clean duplicate data in cells in pandas?

I have a data frame where the column gender has duplicates within the cells, here is an example:

1. Male
2. Female, female
3. Female, female , Female, female 

Upvotes: 1

Views: 88

Answers (2)

Frenchy
Frenchy

Reputation: 17027

you just keep the first split:

df['gender'] = df['gender'].apply(lambda x: x.split(',')[0])

for the case Male and Female inside same cell, its your choice, or you drop the row, or you decide the first Gender is ok (my solution) or you set another value to identify later. but its not your first demand

Upvotes: 1

jezrael
jezrael

Reputation: 863166

Convert values to lowercase, then split, convert to sets and join back if necessary:

df['new'] = df['col'].apply(lambda x: ', '.join(set(x.lower().split(', '))))
print (df)
                                col     new
1.0                            Male    male
2.0                  Female, female  female
3.0  Female, female, Female, female  female

Solution for remove rows with rows not contains , - it means multiple values per cells:

print (df)
                              col
1.0                          Male
2.0                Female, female
3.0  Female, male, Female, female

df['new'] = df['col'].apply(lambda x: '&'.join(set(x.lower().split(', '))))
print (df)
                              col          new
1.0                          Male         male
2.0                Female, female       female
3.0  Female, male, Female, female  female&male

df = df[df['new'].str.count('&') == 0]
print (df)
                col     new
1.0            Male    male
2.0  Female, female  female

Upvotes: 3

Related Questions