SkyFox
SkyFox

Reputation: 1875

How to select next node using scrapy

I have html looks like this:

<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>

I understand how to extract using scrapy information from h1:

content.select("//h1[contains(text(),'Text 1')]/text()").extract()

But my goal is to extract content from <div>Some info</div>

My problem is that I don't have any specific information about div. All what I know, that it goes exactly after <h1>Text 1</h1>. Can I, using selectors, get NEXT element in tree? Element, that situated on the same level in DOM tree?

Something like:

a = content.select("//h1[contains(text(),'Text 1')]/text()")
a.next("//div/text()").extract()
Some info

Upvotes: 15

Views: 8902

Answers (2)

Ivan Ogai
Ivan Ogai

Reputation: 1486

Use following-sibling. From https://www.w3.org/TR/2017/REC-xpath-31-20170321/

the following-sibling axis contains the context node's following siblings, those children of the context node's parent that occur after the context node in document order;

Example:

from scrapy.selector import Selector
text = '''
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
'''
sel = Selector(text=text)
h1s = sel.xpath('//h1/text()')
for counter, h1 in enumerate(h1s,1):
    div = sel.xpath('(//h1)[{}]/following-sibling::div[1]/text()'.format(counter))
    print(h1.get())
    print(div.get())

The output is:

Text 1
Some info
Text 2
...

Upvotes: 0

kev
kev

Reputation: 161674

Try this xpath:

//h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()

Upvotes: 20

Related Questions