Reputation: 1099
I'm scraping these two sites:
Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href
tag, while the other one is just plain text. How can I select one or the other depending which one is there?
I tried this to no avail:
level.css(/"a[href]"|".left"/).text
Here are shortened versions of the 2 HTML sections:
<table class="chart">
<tr valign="middle">
<td class="left">Level 2</td> <!-- the problem -->
<td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
</tr>
</table>
<table class="chart">
<tr valign="middle">
<td class="left"><a href="availablepcsembed.php?branch=BSL&room=Lvl1">Level 1</a></td>
<td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
</tr>
</table>
My Code (edited from section of code to whole method)
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
end
end
Upvotes: 0
Views: 45
Reputation: 1957
The answer by jk_ should work in this particular case.
In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be
level.css('a[href], .left').text
Upvotes: 1
Reputation: 1099
Thanks to inspiration from @jk_ I fixed it using .css(".left").text
. That just selects all the text in the left td
inside the tr
.
The working code:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css(".left").text, available: right[0], out_of_available: right[3]}
end
end
Upvotes: 0
Reputation: 755
If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text
on the parsed td
element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:
details_page.css("table.chart tr").collect do |level|
level = level.text
end
For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.
Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:
details_page.css("table.chart tr td.left").collect do |level|
level = level.text
end
Upvotes: 2