Reputation: 4122
I am using Python.org version 2.7 64 bit on Vista 64 bit to run Scrapy. I am trialing scraping some text from this webpage and have managed to get most of the text cleaned up, removing line breaks and HTML tags. However tags still seem to be included in the text output to Command Shell:
u' British Grand Prix practice results ', u'
This is from the following webpage:
http://www.bbc.co.uk/sport/0/formula1/28166984 The above string represents a hyperlink to another page. I have tried using the following regular expression to remove the 'u' tags, but it has not worked:
body = response.xpath("//p").extract()
body2 = str(body)
body3 = re.sub(r'(\\[u]|\s){2,}', ' ', body2)
Can anyone suggest a way or removing these tags? Also, if possible, can you use regular expressions to remove everything between two tags as well?
Thanks
Upvotes: 1
Views: 4195
Reputation: 3599
As furas mentioned, it is only displaying the encoding. By default, 2.7x uses ascii, so when a string is in unicode, it is denoted with a u. You can go back and forth using unicode and encode('utf-8')
>>> a = 's'
>>> a
's'
>>> a = unicode('s')
>>> a
u's'
>>> a = a.encode('utf-8')
>>> a
's'
Here's how to do it with a list
>>> ul = []
>>> ul.append(unicode('British Grand Prix practice results'))
>>> ul.append(unicode('some other string'))
>>> ul
[u'British Grand Prix practice results', u'some other string']
>>> l = []
>>> for s in ul:
... l.append(s.encode('utf-8'))
...
>>> l
['British Grand Prix practice results', 'some other string']
>>>
Upvotes: 3
Reputation: 142681
u
is only python information that this text is coded in Unicode.
You have to print text in correct way to get it without this inforamtion.
a = [ u'hello', u'world' ]
print a
[u'hello', u'world']
for x in a:
print x
hello
world
In you situation probably body
is a list of strings
print type(body)
so do this
body2 = ''
for x in body:
body += x
print body2
or even better:
body2 = "".join(body)
print body2
Upvotes: 2