LauraF
LauraF

Reputation: 365

Pandas dataframe replace em-dash with nan

I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. However, the files use em-dash for missing values. I am trying to get Python to replace all these em-dashes as nans. I can't seem to find a way to get Python to even recognize the character, let alone replace it. I tried the following which did not work

df['var'].apply(lambda x: re.sub(u'\2014','',x))

I also tried simply

df['var'].astype('float')

What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats?

Upvotes: 1

Views: 3689

Answers (4)

DYZ
DYZ

Reputation: 57033

You should catch the error at an earlier stage. Tell pd.read_excel() to treat em-dashes as NaNs:

df = pd.read_excel(..., na_values=['–','—'])

Upvotes: 5

LauraF
LauraF

Reputation: 365

Not sure exactly what was going on with those dashes (which showed up like u'\u2013' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers.

import unicodedata

df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])

Upvotes: 0

dejoma
dejoma

Reputation: 504

df.replace({'-': None}) is what you are looking for. Found in another post on stack overflow.

Upvotes: -1

sacuL
sacuL

Reputation: 51345

I think the most straightforward way to do this would be pd.to_numeric with the argument errors='coerce':

df['var'] = pd.to_numeric(df['var'], errors='coerce')

From the docs:

If ‘coerce’, then invalid parsing will be set as NaN

Upvotes: 1

Related Questions