Sam B.
Sam B.

Reputation: 3033

python unicode replace backslash u with an empty string

I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.

u'\u2014'.replace('\u','')
Result: u'\u2014'

I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.

pandas code

merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)

Error

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)

Upvotes: 4

Views: 3650

Answers (2)

gauravtolani
gauravtolani

Reputation: 130

Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.

Things that can help:

  • Converting to ascii using .encode('ascii', 'ignore')
  • As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
  • Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')

Hope this helps.

Upvotes: 1

rawwar
rawwar

Reputation: 5012

u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know

This is the output in ipython:

In [4]: print("val = ", u'\u2014')
val =  —

Based on your comment, here is what you are doing wrong "-" is not same as "EM Dash" Unicode character(u'\u2014')

So, you should do the following

print(u'\u2014'.replace("\u2014",""))

and that will work

EDIT: since you are using python 2.x, you have to encode it with utf-8 as follows

u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")

Upvotes: 3

Related Questions