Reputation: 211
I want to extract an element if the previous elements text() matches specific criteria. for example,
<html>
<div>
<table class="layouttab">
<tbody>
<tr>
<td scope="row" class="srb">General information: </td>
<td>(xxx) yyy-zzzz</td>
</tr>
<tr>
<td scope="row" class="srb">Website: </td>
<td><a href="http://xyz.edu" target="_blank">http://www.xyz.edu</a>
</td>
</tr>
<tr>
<td scope="row" class="srb">Type: </td>
<td>4-year, Private for-profit</td>
</tr>
<tr>
<td scope="row" class="srb">Awards offered: </td>
<td>Less than one year certificate<br>One but less than two years certificate<br>Associate's degree<br>Bachelor's
degree
</td>
</tr>
<tr>
<td scope="row" class="srb">Campus setting: </td>
<td>City: Small</td>
</tr>
<tr>
<td scope="row" class="srb">Related Institutions:</td>
<td><a href="?q=xyz">xyz-New York</a>
(Parent):
<ul>
<li style="list-style:circle">Berkeley College - Westchester Campus</li>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
</html>
Now, I want to extract the URL if the previous element has "Website: " in text() properties. I am using python 2.x with scrapy 0.14. I was able to extract data using individual element such as
item['Header_Type']= site.select('div/table[@class="layouttab"]/tr[3]/td[2]/text()').extract()
But this approach fails if the website parameter is missing and the tr[3] shift upward and i get 'Type' in website element and 'Awards offered' in Type.
Is there a specific command in xPath like,
'div/table[@class="layouttab"]/tr/td[2] {if td[1] has text = "Website"}
Thanks in advance.
Upvotes: 3
Views: 8617
Reputation: 76
For python and scrapy you should use following to select "Type" field, worked great for me.
item['Header_Type']= site.select('div[1]/table[@class="layouttab"]/tr/td[contains(text(),"Type")]/following-sibling::td[1]/text()').extract()
Upvotes: 5
Reputation: 116
This will also work.. And is more generic..
//table[@class='layouttab']//td[contains(text(),'Website')]/following-sibling::td//text()
If there is only one table on the page where u are extracting data then this will also work..
//td[contains(text(),'Website')]/following-sibling::td//text()
Upvotes: 1
Reputation: 6271
This works for me:
/html/div/table[@class="layouttab"]/tbody/tr/td[. = 'Website: ']/following-sibling::td/a/text()
td
and see if its text matches Website:
following-sibling
to go to the next td
a
and get the URL using text()
Upvotes: 1
Reputation: 18553
The following XPath will do:
/html/div/table[@class='layouttab']/tbody/tr/td[contains(text(),'Website')]/following-sibling::td[1]
Upvotes: 1
Reputation: 26160
div/table[@class="layouttab"]\tr\td[text()="Website"]\following-sibling::node()
will work, I think. Otherwise, you could user parent
and go to td[2]
from there.
Upvotes: 4