Reputation: 105
I'm using scrapy to process documents like this one:
...
<div class="contents">
some text
<ol>
<li>
more text
</li>
...
</ol>
</div>
...
I want to collect all the text inside the contents area into a string.
I also need the '1., 2., 3....' from the <li>
elements, so my result should be 'some text 1. more text...'
So, I'm looping over <div class="contents">
's children
for n in response.xpath('//div[@class="contents"]/node()'):
if n.xpath('self::ol'):
result += process_list(n)
else:
result += n.extract()
If n
is an ordered list, I loop over its elements and add a number to li/text()
(in process_list()
). If n
is a text node itself, I just read its value.
However, 'some text'
doesn't seem to be part of the node set, since the loop doesn't get inside the else
part. My result is '1. more text'
Finding text nodes relative to their parent node works:
response.xpath('//div[@class="contents"]//text()')
finds all the text, but this way I can't add the list item numbers.
What am I doing wrong and is there a better way to achieve my task?
Upvotes: 1
Views: 713
Reputation: 20748
Scrapy's Selectors use lxml
under the hood, but lxml
doesn't work with XPath calls on text nodes.
>>> import scrapy
>>> s = scrapy.Selector(text='''<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>''')
>>> s.xpath('.//div[@class="contents"]/node()')
[<Selector xpath='.//div[@class="contents"]/node()' data='\n some text\n '>, <Selector xpath='.//div[@class="contents"]/node()' data='<ol>\n <li>\n more text\n'>, <Selector xpath='.//div[@class="contents"]/node()' data='\n'>]
>>> for n in s.xpath('.//div[@class="contents"]/node()'):
... print(n.xpath('self::ol'))
...
[]
[<Selector xpath='self::ol' data='<ol>\n <li>\n more text\n'>]
[]
But you could hack into the underlying lxml object to test it's type for a text node (it's "hidden" in a .root
attribute of each scrapy Selector):
>>> for n in s.xpath('.//div[@class="contents"]/node()'):
... print([type(n.root), n.root])
...
[<class 'str'>, '\n some text\n ']
[<class 'lxml.etree._Element'>, <Element ol at 0x7fa020f2f9c8>]
[<class 'str'>, '\n']
An alternative is to use some HTML-to-text conversion library like html2text
>>> import html2text
>>> html2text.html2text("""<div class="contents">
... some text
... <ol>
... <li>
... more text
... </li>
... ...
... </ol>
... </div>""")
'some text\n\n 1. more text \n...\n\n'
Upvotes: 2
Reputation: 27996
If n
is not an ol
element, self::ol
yields an empty node set. What is n.xpath(...)
supposed to return when the result of the expression is an empty node set?
An empty node set is "falsy" in XPath, but you're not evaluating it as a boolean in XPath, only in Python. Is an empty node set falsy in Python?
If that's the problem, you could fix it by changing the if
statement to
if n.xpath('boolean(self::ol)'):
or
if n.xpath('count(self::ol) > 1'):
Upvotes: 0