Coding_Rabbit
Coding_Rabbit

Reputation: 1337

How to get full link text with Scrapy

I've use scrapy to get datas from webpage.And I encountered a problem as below.

<li>
<a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
<b>
man
</b>
X -
<i>
Escherichia coli
</i>
</a>
<br>
</li>

In webpage,the record's name looks like this: enter image description here

I want to get the content (e.g.:man X-Escherichia coli) in the <a> tag and don't want to get other tags. And Here is my code:

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]')
    base_url = "http://www.metacyc.org/META"
for site in sites:
    item = MetaCyc()
    name_tmp = map(unicode.strip, site.xpath('text()').extract())
    item['Name'] = unicode(name_tmp).encode('utf-8')
    item['Link'] = map(unicode.strip, site.xpath('@href').extract())
    yield item

I have tried to convert the unicode to utf-8, but the results still looks like this:

{"Link": ["NEW-IMAGE?type=GENE&object=EG10567"], "Name": "[u'X -']"} 

Sometimes there will have some character missing in the records. So I want to know how to get the complete and correct format data from HTML code.

Upvotes: 0

Views: 1242

Answers (3)

nathanl93
nathanl93

Reputation: 487

I want to get the content (e.g.:man X-Escherichia coli) in the <a> tag and don't want to get other tags.

Part of the problem is that the text is not all contained in the <a> tag. Some of it is nested in the <i> tag underneath the <a> tag. To get the full link text as a string:

item_name = " ".join([word.strip() for word in sel.xpath('//li/a[contains(@href,"NEW-IMAGE")]//text()').extract() if len(word.strip())])  
# => item_name = 'man X - Escherichia coli'

The //a//text() means to recursively grab all text under all the <a> tags and their children in the document. Your sel.xpath('//ul/li/a[contains(@href,"NEW-IMAGE")]/text()').extract() would get "Some text"

<a href="../">Some text</a>

But would omit "And some more here" inside the <b> tags:

<a href="../">Some text<b>And some more here</b></a> 

Upvotes: 0

paul trmbrth
paul trmbrth

Reputation: 20748

I suggest you use XPath's normalize-space()

The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are the same as those allowed by the S production in XML. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.

>>> html = """<li>
... <a href="NEW-IMAGE?type=GENE&amp;object=EG10567">
... <b>
... man
... </b>
... X -
... <i>
... Escherichia coli
... </i>
... </a>
... <br>
... </li>"""
>>> import scrapy
>>> selector = scrapy.Selector(text=html)

>>>
>>> links = selector.xpath('//li/a[contains(@href,"NEW-IMAGE")]')
>>> for link in links:
...     item = {}
...     item['Name'] = link.xpath('normalize-space(.)').extract_first()
...     item['Link'] = link.xpath('@href').extract_first()
...     print(item)
... 
{'Link': u'NEW-IMAGE?type=GENE&object=EG10567', 'Name': u'man X - Escherichia coli'}
>>> 

Upvotes: 1

Rahul
Rahul

Reputation: 3396

If you want to get text ofa tag and its child's you need to use //text() instead of text()

Try this:

name_tmp = map(unicode.strip, site.xpath('//text()').extract())

You can use another module html2text to get only text of a particular tag.

import html2text
htmlconverter = html2text.HTML2Text()
print htmlconverter.handle(''.join(name_tmp))

Upvotes: 0

Related Questions