Reputation: 187
I have panda dataframe with multiple columns which mixed with values and unwanted characters.
columnA columnB columnC ColumnD
\x00A\X00B NULL \x00C\x00D 123
\x00E\X00F NULL NULL 456
what I'd like to do is to make this dataframe as below.
columnA columnB columnC ColumnD
AB NULL CD 123
EF NULL NULL 456
With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.
col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
if df[i].dtype != np.int64:
df[i] = df[i].map(lambda x: x.translate(fixer))
Is there any efficient way to remove unwanted characters from columnC?
Upvotes: 6
Views: 7201
Reputation: 402463
In general, to remove non-ascii characters, use str.encode
with errors='ignore':
df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')
To perform this on multiple string columns, use
u = df.select_dtypes(object)
df[u.columns] = u.apply(
lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
Although that still won't handle the null characters in your columns. For that, you replace them using regex:
df2 = df.replace(r'\W+', '', regex=True)
Upvotes: 8
Reputation: 1119
What is the trick with NULL
?
If you want to replace string 'NULL'
with real NaN
use replace
:
df.replace('NULL',np.NaN, inplace=True)
print(df.isnull())
Out:
columnA columnB columnC columnD 0 False True False False 1 False True True False
Or you need replace 'NULL'
with empty string, use RegEx in str.replace
df = df.apply(lambda col: col.str.replace(
r"[\x00|NULL]", "") if col.dtype == object else col)
print (df.isnull())
print (df.values)
Out:
columnA columnB columnC columnD 0 False False False False 1 False False False False [['AB' '' 'CD' 123] ['EF' '' '' 456]]
Upvotes: 0