Selecting a specific table cell using CSS

Question

I scraped the rankings table from atpworldtour.com and I'm trying to access the player names.

An example of a row in the table looks like this:


  1
  
    
    
    
  
  
    
      
        
      
    
  
  
    Novak Djokovic
  
  28
  
    15,785
  
  
    17
  
  1,500
  0

I tried a few different ways of pulling this information. So far the most success I've had so far is with this:

url = "http://www.atpworldtour.com/en/rankings/singles"
doc = Nokogiri::HTML(open(url))

doc.css("tr").each do |row|
  puts row.css("td a")
end

The problem is, there are two other links in each row after the player's name so I get them all lumped together. Player's names are the fourth cell in the table so I tried to pull the fourth cell first and then access the link:

doc.css("tr").each do |row|
  cell = row.css("td")[3]
  puts cell.css("a").text
end

but that returns the error undefined method 'css' for nil:NilClass.

Upon further investigation, cell seemed to be storing ALL the cells with the player names instead of just the one for the current iteration of row, but when I then tried to iterate through cell I got the same undefined method error.

I also tried to solve this problem using XPath:

doc.xpath("//tr").each do |row|
  puts row.xpath("/td[3]/a").text
end

but the output is a big area of blank space where the names should be listed.

Are there any tips about what I'm doing wrong?
Can anyone point me toward detailed documentation for using CSS/XPath selectors with Nokogiri I'd be grateful.

Everything I've found so far only covers the very basics and I'm having trouble finding information on how to perform more complex operations.

I actually got it working using:

doc.xpath("//tr").each do |row|
  puts row.at_css("a").text
end

but any help finding proper documentation/tutorials for using XPath and CSS selectors with Nokogiri would still be great.

the Tin Man · Accepted Answer

Perhaps this will help shed some light on what's happening:

require 'nokogiri'
doc = Nokogiri::HTML('foo bar')

at returns the first matching Node. In this case it's the . Using text returns all the text inside it concatenated together:

doc.at('tr').to_html # => "
foo
bar
"
doc.at('tr').text # => "foobar"

Using search returns a NodeSet, which is most easily thought of as an Array. In this case it'll return two elements, one for each pair:

doc.search('tr td').size # => 2

text will return the text for all Nodes in the NodeSet, concatenating the strings again:

doc.search('tr td').to_html # => "foo
bar"
doc.search('tr td').text # => "foobar"

However, by iterating over each Node in the NodeSet we can see the individual text:

doc.search('tr td').map(&:text) # => ["foo", "bar"]

An alternate, but slightly slower, way to do the same thing is to find the Node first, then searching inside it for the individual nodes:

doc.at('tr').search('td').size # => 2
doc.at('tr').search('td').to_html # => "foo
bar"
doc.at('tr').search('td').text # => "foobar"

Again, using map we can iterate over them and get the text without concatenation:

doc.at('tr').search('td').map(&:text) # => ["foo", "bar"]

Here's the difference in speed using a single vs. separate selectors to descend and select the nodes:

require 'fruity'
require 'nokogiri'

doc = Nokogiri::HTML('foo bar')

compare do
  single_selector { doc.search('tr td').map(&:text) }
  separate_selectors { doc.at('tr').search('td').map(&:text) }
end
# >> Running each test 32 times. Test will take about 1 second.
# >> single_selector is faster than separate_selectors by 2x ± 0.1

The difference is due to a single round-trip call to libXML2 for tr td vs. two for doc.at('tr').search('td').

Unfortunately, sometimes we're forced into using the longer, slower form, if we need to use conditional logic or access multiple disparate types of child nodes in the order they're presented in the markup.

Selecting a specific table cell using CSS

Answers (2)

Related Questions