kuixiong
kuixiong

Reputation: 523

How to get innerHTML of a node using scrapy Selector?

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

Upvotes: 11

Views: 13065

Answers (4)

Awais Asghar
Awais Asghar

Reputation: 39

try this

response.xpath('//a/node()').extract()

Upvotes: 3

Mario7
Mario7

Reputation: 103

in scrapy 1.5, you can use /* to get innerhtml. example:

content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()

Upvotes: 0

paul trmbrth
paul trmbrth

Reputation: 20748

You can use XPath's string() function on the elements you select:

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>> 

Upvotes: 11

Cristian Lupascu
Cristian Lupascu

Reputation: 40506

Here's what I managed to do:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

The selector a *::text() matches all the text nodes which are descendents of a nodes.

Upvotes: 12

Related Questions