Matt Hough
Matt Hough

Reputation: 1099

Selecting variations in Nokogiri

I'm scraping these two sites:

  1. https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Law
  2. https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=BSL.

Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href tag, while the other one is just plain text. How can I select one or the other depending which one is there?

I tried this to no avail:

level.css(/"a[href]"|".left"/).text

Here are shortened versions of the 2 HTML sections:

<table class="chart"> 
    <tr valign="middle">
        <td class="left">Level 2</td> <!-- the problem -->
        <td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
    </tr>
</table>

<table class="chart">
    <tr valign="middle">
        <td class="left"><a href="availablepcsembed.php?branch=BSL&room=Lvl1">Level 1</a></td>
        <td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
    </tr>
</table>

My Code (edited from section of code to whole method)

def self.scrape_details_page(library_url)
    details_page = Nokogiri::HTML(open(library_url))

    details_page.css("table.chart tr").collect do |level|
        right = level.css(".right").text.split
        {level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
    end
end

Upvotes: 0

Views: 45

Answers (3)

Max
Max

Reputation: 1957

The answer by jk_ should work in this particular case.

In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be

level.css('a[href], .left').text

Upvotes: 1

Matt Hough
Matt Hough

Reputation: 1099

Thanks to inspiration from @jk_ I fixed it using .css(".left").text. That just selects all the text in the left td inside the tr.

The working code:

def self.scrape_details_page(library_url)
    details_page = Nokogiri::HTML(open(library_url))

    details_page.css("table.chart tr").collect do |level|
        right = level.css(".right").text.split
        {level: level.css(".left").text, available: right[0], out_of_available: right[3]}
    end
end

Upvotes: 0

jk_
jk_

Reputation: 755

If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text on the parsed td element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:

details_page.css("table.chart tr").collect do |level|
     level = level.text
end

For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.

Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:

details_page.css("table.chart tr td.left").collect do |level|
     level = level.text
end

Upvotes: 2

Related Questions