Reputation: 365
I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. However, the files use em-dash for missing values. I am trying to get Python to replace all these em-dashes as nans. I can't seem to find a way to get Python to even recognize the character, let alone replace it. I tried the following which did not work
df['var'].apply(lambda x: re.sub(u'\2014','',x))
I also tried simply
df['var'].astype('float')
What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats?
Upvotes: 1
Views: 3689
Reputation: 57033
You should catch the error at an earlier stage. Tell pd.read_excel()
to treat em-dashes as NaNs:
df = pd.read_excel(..., na_values=['–','—'])
Upvotes: 5
Reputation: 365
Not sure exactly what was going on with those dashes (which showed up like u'\u2013' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers.
import unicodedata
df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])
Upvotes: 0
Reputation: 504
df.replace({'-': None})
is what you are looking for. Found in another post on stack overflow.
Upvotes: -1