Reputation: 67
I have a data frame where the column gender has duplicates within the cells, here is an example:
1. Male
2. Female, female
3. Female, female , Female, female
Upvotes: 1
Views: 88
Reputation: 17027
you just keep the first split:
df['gender'] = df['gender'].apply(lambda x: x.split(',')[0])
for the case Male and Female inside same cell, its your choice, or you drop the row, or you decide the first Gender is ok (my solution) or you set another value to identify later. but its not your first demand
Upvotes: 1
Reputation: 863166
Convert values to lowercase, then split, convert to set
s and join back if necessary:
df['new'] = df['col'].apply(lambda x: ', '.join(set(x.lower().split(', '))))
print (df)
col new
1.0 Male male
2.0 Female, female female
3.0 Female, female, Female, female female
Solution for remove rows with rows not contains ,
- it means multiple values per cells:
print (df)
col
1.0 Male
2.0 Female, female
3.0 Female, male, Female, female
df['new'] = df['col'].apply(lambda x: '&'.join(set(x.lower().split(', '))))
print (df)
col new
1.0 Male male
2.0 Female, female female
3.0 Female, male, Female, female female&male
df = df[df['new'].str.count('&') == 0]
print (df)
col new
1.0 Male male
2.0 Female, female female
Upvotes: 3