Joohun Lee
Joohun Lee

Reputation: 187

Remove non-ASCII characters from string columns in pandas

I have panda dataframe with multiple columns which mixed with values and unwanted characters.

columnA        columnB    columnC        ColumnD
\x00A\X00B     NULL       \x00C\x00D        123
\x00E\X00F     NULL       NULL              456

what I'd like to do is to make this dataframe as below.

columnA  columnB  columnC   ColumnD
AB        NULL       CD        123
EF        NULL       NULL      456

With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.

col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
    if df[i].dtype != np.int64:
            df[i] = df[i].map(lambda x: x.translate(fixer))

Is there any efficient way to remove unwanted characters from columnC?

Upvotes: 6

Views: 7201

Answers (2)

cs95
cs95

Reputation: 402463

In general, to remove non-ascii characters, use str.encode with errors='ignore':

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

To perform this on multiple string columns, use

u = df.select_dtypes(object)
df[u.columns] = u.apply(
    lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

Although that still won't handle the null characters in your columns. For that, you replace them using regex:

df2 = df.replace(r'\W+', '', regex=True)

Upvotes: 8

ilia timofeev
ilia timofeev

Reputation: 1119

What is the trick with NULL?
If you want to replace string 'NULL' with real NaN use replace:

df.replace('NULL',np.NaN, inplace=True)
print(df.isnull()) 

Out:

   columnA  columnB  columnC  columnD
0    False     True    False    False
1    False     True     True    False

Or you need replace 'NULL' with empty string, use RegEx in str.replace

df = df.apply(lambda col: col.str.replace(
               r"[\x00|NULL]", "") if col.dtype == object else col)

print (df.isnull())
print (df.values)

Out:


   columnA  columnB  columnC  columnD
0    False    False    False    False
1    False    False    False    False

[['AB' '' 'CD' 123]
 ['EF' '' '' 456]]

Upvotes: 0

Related Questions