user959129
user959129

Reputation:

Python, Unicode and parsing with lxml and how to deal with 35\xa0new

I am extracting a field on a webpage ad the tag html text content looks like this...

35 new

In python the extracted data looks like this...

35\xa0new

How to I deal with unicode in python to convert to a regular string?
"35 new"

what library to I use?

Thanks

Upvotes: 0

Views: 679

Answers (2)

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799580

Avoid working with regular strings whenever possible; unicodes are generally more useful for text, and there are many well-known solutions for manipulating and dealing with them.

Upvotes: 3

Ned Batchelder
Ned Batchelder

Reputation: 376082

You are getting unicode strings from the parser. You can replace certain characters if you prefer others. For example, your \xa0 is a non-breaking space, and you can replace it with a regular space:

text = text.replace(u"\xa0", u" ")

There could be many of these characters that you want to change, so it might be a long process of finding all the ones that occur in your data.

Upvotes: 0

Related Questions