Reputation: 170
I have a dataframe, df, with column "x". Setting variable col = "x". On trying to execute the following to get unique values
list(df[col].dropna().unique())
I am getting back a list with duplicate values
['1.Project Viewer\n2.Project Lead',
'1․Project Viewer\n2․Project Lead',
'1.Project Viewer\n2.Project Member\n3.Program Lead']
I have used conversion to string type and stripping prior to finding the unique values. Like this
df[col] = df[col].dropna().astype('str')
df[col] = df[col].str.strip()
Any particular reason for this? How can I get the unique values?
Upvotes: 1
Views: 595
Reputation: 181
The list you pasted does contain only unique values. Character encoding is working against you here.
The first entry is using the .
character, while the second entry is using the visually similar but technically distinct ․
character. The eye sees these records as duplicate, but the bits in the strings don't match, so they're each unique.
I'd recommend searching your data for entries with characters that fall outside the range of standard ASCII characters. Then, you can decide whether to manually clean those records up (if there are relatively few of them) or create some way of cleaning them up programmatically (such as replacing all non-standard characters with an underscore, or finding appropriate substitutions for those characters.)
Upvotes: 2