xybee
xybee

Reputation: 23

Xpath select nodes between two nodes in scrapy 0.24.5

<h3>Q1</h3>  
<p><p>text1</p></p><a name="1"> </a>  
<p>...</p>  
...  
<ul><li>...</li></ul>
<h3>Q2</h3>  
<p>text2</p><a name="2"> </a>  
<p>...</p>  
...  
<ul><li>...</li></ul>
<h3>Q3</h3>  
<p>text3</p>
<p>...</p>  
...  
<ul><li>...</li></ul>

Above is my html, and I want to grab the text of individual h3 and text of the nodes that follow it till the next h3. In other words, if I were to put them in a dictionary, the result would look like:

{Q1:text1, Q2:text2, Q3:text3}    

I tried first selecting all the h3 tags, and then loop through the list of h3 tags. For each h3 tag, I tried to select all the nodes before the next h3 tag. Here is my code:

>>> h3_tags = response.xpath(".//h3")   
>>> for h3_tag in h3_tags:    
>>>     texts = h3_tag.xpath("./following-sibling::node()[count(preceding-sibling::h3)=1]/descendant-or-self::text()").extract()  

But this only extract the p text for after the first h3 tag (besides that it also include the text of the second h3 tag), and I got nothing for the rest of the h3 tags.

if I use:

>>> h3_tags = response.xpath(".//h3")   
>>> for h3_tag in h3_tags:    
>>>     texts = h3_tag.xpath("./following-sibling::node()[preceding-sibling::h3]/descendant-or-self::text()").extract()  

I got redundant texts from previous p for the second and third h3.

I'm using this in Scrapy 0.24.5, and this is my first day. Any help is appreciated!

Upvotes: 1

Views: 674

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You can still use the count(preceding-sibling...) technique, with some help from enumerate()

>>> for cnt, h3 in enumerate(selector.xpath('.//h3'), start=1):
...     print h3.xpath('./following-sibling::node()[count(preceding-sibling::h3)=%d]' % cnt).extract()
... 
[u'  \n', u'<p></p>', u'<p>text1</p>', u'<a name="1"> </a>', u'  \n', u'<h3>Q2</h3>']
[u'  \n', u'<p>text2</p>', u'<a name="2"> </a>', u'  \n', u'<h3>Q3</h3>']
[u'  \n', u'<p>text3</p>']
>>> 
>>> for cnt, h3 in enumerate(selector.xpath('.//h3'), start=1):
...     print h3.xpath('./following-sibling::node()[count(preceding-sibling::h3)=%d]/descendant-or-self::text()' % cnt).extract()
... 
[u'  \n', u'text1', u' ', u'  \n', u'Q2']
[u'  \n', u'text2', u' ', u'  \n', u'Q3']
[u'  \n', u'text3']
>>> 

Note that <p><p>text1</p></p> did not play well with lxml, creating 2 sibling ps and not a p in p

Upvotes: 2

Related Questions