Reputation: 1337
I've use scrapy to get datas from webpage.And I encountered a problem as below.
<li>
<a href="NEW-IMAGE?type=GENE&object=EG10567">
<b>
man
</b>
X -
<i>
Escherichia coli
</i>
</a>
<br>
</li>
In webpage,the record's name looks like this:
I want to get the content (e.g.:man X-Escherichia coli) in the <a>
tag and don't want to get other tags. And Here is my code:
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]')
base_url = "http://www.metacyc.org/META"
for site in sites:
item = MetaCyc()
name_tmp = map(unicode.strip, site.xpath('text()').extract())
item['Name'] = unicode(name_tmp).encode('utf-8')
item['Link'] = map(unicode.strip, site.xpath('@href').extract())
yield item
I have tried to convert the unicode to utf-8, but the results still looks like this:
{"Link": ["NEW-IMAGE?type=GENE&object=EG10567"], "Name": "[u'X -']"}
Sometimes there will have some character missing in the records. So I want to know how to get the complete and correct format data from HTML code.
Upvotes: 0
Views: 1242
Reputation: 487
I want to get the content (e.g.:man X-Escherichia coli) in the
<a>
tag and don't want to get other tags.
Part of the problem is that the text is not all contained in the <a>
tag. Some of it is nested in the <i>
tag underneath the <a>
tag. To get the full link text as a string:
item_name = " ".join([word.strip() for word in sel.xpath('//li/a[contains(@href,"NEW-IMAGE")]//text()').extract() if len(word.strip())])
# => item_name = 'man X - Escherichia coli'
The //a//text()
means to recursively grab all text under all the <a>
tags and their children in the document. Your sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]/text()').extract()
would get "Some text"
<a href="../">Some text</a>
But would omit "And some more here" inside the <b>
tags:
<a href="../">Some text<b>And some more here</b></a>
Upvotes: 0
Reputation: 20748
I suggest you use XPath's normalize-space()
The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are the same as those allowed by the S production in XML. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
>>> html = """<li>
... <a href="NEW-IMAGE?type=GENE&object=EG10567">
... <b>
... man
... </b>
... X -
... <i>
... Escherichia coli
... </i>
... </a>
... <br>
... </li>"""
>>> import scrapy
>>> selector = scrapy.Selector(text=html)
>>>
>>> links = selector.xpath('//li/a[contains(@href,"NEW-IMAGE")]')
>>> for link in links:
... item = {}
... item['Name'] = link.xpath('normalize-space(.)').extract_first()
... item['Link'] = link.xpath('@href').extract_first()
... print(item)
...
{'Link': u'NEW-IMAGE?type=GENE&object=EG10567', 'Name': u'man X - Escherichia coli'}
>>>
Upvotes: 1
Reputation: 3396
If you want to get text ofa
tag and its child's you need to use //text()
instead of text()
Try this:
name_tmp = map(unicode.strip, site.xpath('//text()').extract())
You can use another module html2text
to get only text of a particular tag.
import html2text
htmlconverter = html2text.HTML2Text()
print htmlconverter.handle(''.join(name_tmp))
Upvotes: 0