SoSimple
SoSimple

Reputation: 701

Selecting a specific table cell using CSS

I scraped the rankings table from atpworldtour.com and I'm trying to access the player names.

An example of a row in the table looks like this:

<tr>
  <td class="rank-cell">1</td>
  <td class="move-cell">
    <div class="move-none"></div>
    <div class="move-text">
    </div>
  </td>
  <td class="country-cell">
    <div class="country-inner">
      <div class="country-item">
        <img src="/~/media/images/flags/srb.png" alt="SRB" onerror="this.remove()">
      </div>
    </div>
  </td>
  <td class="player-cell">
    <a href="/en/players/novak-djokovic/d643/overview" data-ga-label="Novak Djokovic">Novak Djokovic</a>
  </td>
  <td class="age-cell">28</td>
  <td class="points-cell">
    <a href="/en/players/novak-djokovic/d643/rankings-breakdown?team=singles" data-ga-label="rankings-breakdown">15,785</a>
  </td>
  <td class="tourn-cell">
    <a href="/en/players/novak-djokovic/d643/player-activity?matchType=singles" data-ga-label="player-activity">17</a>
  </td>
  <td class="pts-cell">1,500</td>
  <td class="next-cell">0</td>
</tr>

I tried a few different ways of pulling this information. So far the most success I've had so far is with this:

url = "http://www.atpworldtour.com/en/rankings/singles"
doc = Nokogiri::HTML(open(url))

doc.css("tr").each do |row|
  puts row.css("td a")
end

The problem is, there are two other links in each row after the player's name so I get them all lumped together. Player's names are the fourth cell in the table so I tried to pull the fourth cell first and then access the link:

doc.css("tr").each do |row|
  cell = row.css("td")[3]
  puts cell.css("a").text
end

but that returns the error undefined method 'css' for nil:NilClass.

Upon further investigation, cell seemed to be storing ALL the cells with the player names instead of just the one for the current iteration of row, but when I then tried to iterate through cell I got the same undefined method error.

I also tried to solve this problem using XPath:

doc.xpath("//tr").each do |row|
  puts row.xpath("/td[3]/a").text
end

but the output is a big area of blank space where the names should be listed.

  1. Are there any tips about what I'm doing wrong?
  2. Can anyone point me toward detailed documentation for using CSS/XPath selectors with Nokogiri I'd be grateful.

Everything I've found so far only covers the very basics and I'm having trouble finding information on how to perform more complex operations.


I actually got it working using:

doc.xpath("//tr").each do |row|
  puts row.at_css("a").text
end

but any help finding proper documentation/tutorials for using XPath and CSS selectors with Nokogiri would still be great.

Upvotes: 1

Views: 415

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

Perhaps this will help shed some light on what's happening:

require 'nokogiri'
doc = Nokogiri::HTML('<table><tr><td>foo</td><td>bar</td></tr></table>')

at returns the first matching Node. In this case it's the <tr>. Using text returns all the text inside it concatenated together:

doc.at('tr').to_html # => "<tr>\n<td>foo</td>\n<td>bar</td>\n</tr>"
doc.at('tr').text # => "foobar"

Using search returns a NodeSet, which is most easily thought of as an Array. In this case it'll return two elements, one for each <tr><td> pair:

doc.search('tr td').size # => 2

text will return the text for all Nodes in the NodeSet, concatenating the strings again:

doc.search('tr td').to_html # => "<td>foo</td>\n<td>bar</td>"
doc.search('tr td').text # => "foobar"

However, by iterating over each Node in the NodeSet we can see the individual text:

doc.search('tr td').map(&:text) # => ["foo", "bar"]

An alternate, but slightly slower, way to do the same thing is to find the <tr> Node first, then searching inside it for the individual <td> nodes:

doc.at('tr').search('td').size # => 2
doc.at('tr').search('td').to_html # => "<td>foo</td>\n<td>bar</td>"
doc.at('tr').search('td').text # => "foobar"

Again, using map we can iterate over them and get the text without concatenation:

doc.at('tr').search('td').map(&:text) # => ["foo", "bar"]

Here's the difference in speed using a single vs. separate selectors to descend and select the <td> nodes:

require 'fruity'
require 'nokogiri'

doc = Nokogiri::HTML('<table><tr><td>foo</td><td>bar</td></tr></table>')

compare do
  single_selector { doc.search('tr td').map(&:text) }
  separate_selectors { doc.at('tr').search('td').map(&:text) }
end
# >> Running each test 32 times. Test will take about 1 second.
# >> single_selector is faster than separate_selectors by 2x ± 0.1

The difference is due to a single round-trip call to libXML2 for tr td vs. two for doc.at('tr').search('td').

Unfortunately, sometimes we're forced into using the longer, slower form, if we need to use conditional logic or access multiple disparate types of child nodes in the order they're presented in the markup.

Upvotes: 1

Stefan
Stefan

Reputation: 114188

The table cell containing the player's name has a class player-cell:

<td class="player-cell">
  <a href="/en/players/novak-djokovic/d643/overview" data-ga-label="Novak Djokovic">Novak Djokovic</a>
</td>

You can use this class to fetch the elements:

doc.css('.player-cell a').map(&:text)
#=> ["Novak Djokovic", "Roger Federer", "Andy Murray", ...]

Even without an explicit class, you could fetch the 4th column via:

doc.css('td:nth-child(4) a').map(&:text)
#=> ["Novak Djokovic", "Roger Federer", "Andy Murray", ...]

Or using XPath:

doc.xpath('//td[4]/a').map(&:text)
#=> ["Novak Djokovic", "Roger Federer", "Andy Murray", ...]

Upvotes: 0

Related Questions