Removing character from text using Scrapy

Question

I am using Python.org version 2.7 64 bit on Vista 64 bit to run Scrapy. I am trialing scraping some text from this webpage and have managed to get most of the text cleaned up, removing line breaks and HTML tags. However tags still seem to be included in the text output to Command Shell:

u' British Grand Prix practice results ', u'

This is from the following webpage:

http://www.bbc.co.uk/sport/0/formula1/28166984 The above string represents a hyperlink to another page. I have tried using the following regular expression to remove the 'u' tags, but it has not worked:

body = response.xpath("//p").extract()
body2 = str(body)
body3 = re.sub(r'(\[u]|\s){2,}', ' ', body2)

Can anyone suggest a way or removing these tags? Also, if possible, can you use regular expressions to remove everything between two tags as well?

Thanks

furas · Accepted Answer

u is only python information that this text is coded in Unicode.

You have to print text in correct way to get it without this inforamtion.

a = [ u'hello', u'world' ]

print a

[u'hello', u'world']

for x in a:
    print x

hello
world

In you situation probably body is a list of strings

print type(body)

so do this

body2 = ''

for x in body:
    body += x

print body2

or even better:

body2 = "".join(body)

print body2

Removing <u> character from text using Scrapy

Answers (2)

Related Questions

Removing &lt;u&gt; character from text using Scrapy

Answers (2)

Related Questions

Removing <u> character from text using Scrapy