Remove non-ASCI characters from pandas dataframe

Question

I have read the existing posts regarding how to remove non-ASCI characters of a string in python. But my issue is that when I want to apply it to a dataframe which I have read from a csv file, it doesn't work. Any idea why?

import pandas as pd
import numpy as np
import re
import string
import unicodedata

def preprocess(x):
    # Convert to unicode
    text = unicode(x, "utf8")           
    # Convert back to ascii
    x = unicodedata.normalize('NFKD',text).encode('ascii','ignore') 
    return x  

preprocess("Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG")

'Ludwig Maximilian University of Munich / Munchen (LMU) and Siemens AG'

df = pd.DataFrame(["Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"])
df.columns=['text']
df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
df['text'][0]

'Ludwig Maximilian University of Munich / Munchen (LMU) and Siemens AG'

df1 = pd.read_csv('sample.csv')
df1['text'] = df1['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
df1['text'][0]

'Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG'

Note that df1:

is exactly like df:

Scratch&#39;N&#39;Purr · Accepted Answer

It's because pandas is reading the text in the file as a raw string. It's essentially equivalent to:

df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]})

To get the normalization to work properly, you'd have to process the escaped string. Just modify your preprocess function:

def preprocess(x):
    decoded = x.decode('string_escape')
    text = unicode(decoded, 'utf8')
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')

Should work afterwords:

>>> df = pd.DataFrame({"text": [r"Ludwig Maximilian University of Munich / M\xc3\xbcnchen (LMU) and Siemens AG"]})
>>> df
                                                text
0  Ludwig Maximilian University of Munich / M\xc3...
>>> df['text'] = df['text'].apply(lambda x: preprocess(x) if(pd.notnull(x)) else x)
>>> df
                                                text
0  Ludwig Maximilian University of Munich / Munch...

Remove non-ASCI characters from pandas dataframe

Answers (1)

Related Questions