analina
analina

Reputation: 105

Processing html text nodes with scrapy and XPath

I'm using scrapy to process documents like this one:

...
<div class="contents">
    some text
    <ol>
        <li>
            more text
        </li>
        ...
    </ol>
</div>
...

I want to collect all the text inside the contents area into a string. I also need the '1., 2., 3....' from the <li> elements, so my result should be 'some text 1. more text...'

So, I'm looping over <div class="contents">'s children

for n in response.xpath('//div[@class="contents"]/node()'):
    if n.xpath('self::ol'):
        result += process_list(n)
    else:
        result += n.extract()

If nis an ordered list, I loop over its elements and add a number to li/text() (in process_list()). If nis a text node itself, I just read its value. However, 'some text' doesn't seem to be part of the node set, since the loop doesn't get inside the else part. My result is '1. more text'

Finding text nodes relative to their parent node works:

response.xpath('//div[@class="contents"]//text()')

finds all the text, but this way I can't add the list item numbers.

What am I doing wrong and is there a better way to achieve my task?

Upvotes: 1

Views: 713

Answers (2)

paul trmbrth
paul trmbrth

Reputation: 20748

Scrapy's Selectors use lxml under the hood, but lxml doesn't work with XPath calls on text nodes.

>>> import scrapy
>>> s = scrapy.Selector(text='''<div class="contents">
...     some text
...     <ol>
...         <li>
...             more text
...         </li>
...         ...
...     </ol>
... </div>''')
>>> s.xpath('.//div[@class="contents"]/node()')
[<Selector xpath='.//div[@class="contents"]/node()' data='\n    some text\n    '>, <Selector xpath='.//div[@class="contents"]/node()' data='<ol>\n        <li>\n            more text\n'>, <Selector xpath='.//div[@class="contents"]/node()' data='\n'>]
>>> for n in s.xpath('.//div[@class="contents"]/node()'):
...     print(n.xpath('self::ol'))
... 
[]
[<Selector xpath='self::ol' data='<ol>\n        <li>\n            more text\n'>]
[]

But you could hack into the underlying lxml object to test it's type for a text node (it's "hidden" in a .root attribute of each scrapy Selector):

>>> for n in s.xpath('.//div[@class="contents"]/node()'):
...     print([type(n.root), n.root])
... 
[<class 'str'>, '\n    some text\n    ']
[<class 'lxml.etree._Element'>, <Element ol at 0x7fa020f2f9c8>]
[<class 'str'>, '\n']

An alternative is to use some HTML-to-text conversion library like html2text

>>> import html2text
>>> html2text.html2text("""<div class="contents">
...     some text
...     <ol>
...         <li>
...             more text
...         </li>
...         ...
...     </ol>
... </div>""")
'some text\n\n  1. more text \n...\n\n'

Upvotes: 2

LarsH
LarsH

Reputation: 27996

If n is not an ol element, self::ol yields an empty node set. What is n.xpath(...) supposed to return when the result of the expression is an empty node set?

An empty node set is "falsy" in XPath, but you're not evaluating it as a boolean in XPath, only in Python. Is an empty node set falsy in Python?

If that's the problem, you could fix it by changing the if statement to

if n.xpath('boolean(self::ol)'):

or

if n.xpath('count(self::ol) > 1'):

Upvotes: 0

Related Questions