np.unique not returning unique values

Question

I have a dataframe, df, with column "x". Setting variable col = "x". On trying to execute the following to get unique values

list(df[col].dropna().unique())

I am getting back a list with duplicate values

['1.Project Viewer
2.Project Lead',
 '1․Project Viewer
2․Project Lead',
 '1.Project Viewer
2.Project Member
3.Program Lead']

I have used conversion to string type and stripping prior to finding the unique values. Like this

df[col] = df[col].dropna().astype('str')
df[col] = df[col].str.strip()

Any particular reason for this? How can I get the unique values?

Tom Darby · Accepted Answer

The list you pasted does contain only unique values. Character encoding is working against you here.

The first entry is using the . character, while the second entry is using the visually similar but technically distinct ․ character. The eye sees these records as duplicate, but the bits in the strings don't match, so they're each unique.

I'd recommend searching your data for entries with characters that fall outside the range of standard ASCII characters. Then, you can decide whether to manually clean those records up (if there are relatively few of them) or create some way of cleaning them up programmatically (such as replacing all non-standard characters with an underscore, or finding appropriate substitutions for those characters.)

np.unique not returning unique values

Answers (1)

Related Questions