Reputation: 31548
I have this code
site = hxs.select("//h1[@class='state']")
log.msg(str(site[0].extract()),level=log.ERROR)
The ouput is
[scrapy] ERROR: <h1 class="state"><strong>
1</strong>
<span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span>
</h1>
Is it possible to only get the text without any html tags
Upvotes: 22
Views: 42356
Reputation: 4085
//h1[@class='state']
in your above xpath you are selecting h1
tag that has class
attribute state
so that's why it's selecting everything that comes in h1 element
if you just want to select text of h1
tag all you have to do is
//h1[@class='state']/text()
if you want to select text of h1
tag as well as its children tags, you have to use
//h1[@class='state']//text()
so the difference is /text()
for specific tag text and //text()
for text of specific tag as well as its children tags
below mentioned code works for you
site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()
Upvotes: 60
Reputation: 19146
You can use BeautifulSoup get_text()
feature.
from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
Upvotes: 3
Reputation: 19146
You can use html2text
import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")
Upvotes: 0
Reputation: 383
You can use BeautifulSoup to strip html tags, here is an example:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
You can then strip all the additional whitespaces, new lines etc.
if you don't want to use additional modules, you can try simple regex:
# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))
Upvotes: 2