Reputation: 211
I have this HTML,
<div id="General" class="detailOn">
<div class="tabconstraint"></div>
<div id="InstitutionMain" class="detailseparate">
<div id="InstitutionMain_divINFORight" style="float:right;width:40%"></div>
<div style="font-weight:bold;padding-top:6px">Special Learning Opportunities</div>
Distance learning opportunities<br>
<div style="font-weight:bold;padding-top:6px">Student Services</div>
Remedial services<br>
Academic/career counseling service<br>
<div style="font-weight:bold;padding-top:6px">Credit Accepted</div>
Dual credit<br>
Credit for life experiences<br>
</div>
</div>
I want to extract
text() = between [Div/text() = "Special Learning Opportunities</div>
Distance learning opportunities"] and [div/text()="Student Services"]
similarly for other divs
I tried this code which gives me all text following the identified div,
div[1]/div[contains(text(),"Special Learning Opportunities")]/following-sibling::text()
While this code gives me the all text before the div
div[1]/div[contains(text(),"Student Services")]/preceding-sibling::text()
Is there a way to get exactly all the text in between specified DIVs. Thanks in advance.
I am using python 2.x and scrapy for crawling.
Note: My current method:- using these three xpaths
item['SLO']=site.select('div[1]/div[contains(text(),"Special Learning Opportunities")]/following-sibling::text()').extract()
item['SS']=site.select('div[1]/div[contains(text(),"Student Services")]/following-sibling::text()').extract()
item['CA']=site.select('div[1]/div[contains(text(),"Credit Accepted")]/following-sibling::text()').extract()
I get three items like this,
item['SLO']=['Distance learning opportunities','Remedial services',' Academic/career counseling service','Dual credit','Credit for life experiences']
item['SS']=['Remedial services',' Academic/career counseling service','Dual credit','Credit for life experiences']
item['CA']=['Dual credit','Credit for life experiences']
and then I work on python list to get what i want,
But I think there should be q quicker way in XPath to do so.
Upvotes: 3
Views: 5817
Reputation: 116
You may try this..
//div[contains(text(),"Special Learning Opportunities")]//following-sibling::text()[./following-sibling::div[contains(text(),'Student Services')]]
Upvotes: 1
Reputation: 16907
You can directly translate "text between a and b" into XPath as "text()[previous-sibling = a and next-sibling = b]"
I.e.:
//text()[(preceding-sibling::div[1]/text() = "Special Learning Opportunities") and (following-sibling::div[1]/text() = "Student Services")]
should work.
(although it failed when I tested it, but it seems to be a bug in my XPath interpreter)
Upvotes: 4
Reputation: 3416
Here you go, not so classy as the previous answer, but hey - atleast it works! :-)
div[1]//div[contains(text(),"Special Learning Opportunities")]/following-sibling::node()[position() <= count( div[1]//div[contains(text(),"Student Services")]/following-sibling::node()) + 1]
Upvotes: 2