Reputation: 523
Suppose there are some html fragments like:
<a>
text in a
<b>text in b</b>
<c>text in c</c>
</a>
<a>
<b>text in b</b>
text in a
<c>text in c</c>
</a>
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!
Upvotes: 11
Views: 13065
Reputation: 103
in scrapy 1.5, you can use /*
to get innerhtml.
example:
content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()
Upvotes: 0
Reputation: 20748
You can use XPath's string()
function on the elements you select:
$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
... text in a
... <b>text in b</b>
... <c>text in c</c>
... </a>
... <a>
... <b>text in b</b>
... text in a
... <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
... print link.xpath('string(.)').extract()
...
[u'\n text in a\n text in b\n text in c\n']
[u'\n text in b\n text in a\n text in c\n']
>>>
Upvotes: 11
Reputation: 40506
Here's what I managed to do:
from scrapy.selector import Selector
sel = Selector(text = html_string)
for node in sel.css('a *::text'):
print node.extract()
Assuming that html_string
is a variable holding the html in your question, this code produces the following output:
text in a
text in b
text in c
text in b
text in a
text in c
The selector a *::text()
matches all the text nodes which are descendents of a
nodes.
Upvotes: 12