sharataka
sharataka

Reputation: 5132

How to parse HTML with Nokogiri in Ruby

I am trying to parse some HTML using Nokogiri and am having some issues. I want to go through each "employerReview" class and capture content under the "pros" and "cons".

I am having trouble just doing the first part: getting one item to return to the console.

require 'open-uri'
require 'nokogiri'


doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))

doc.css('//*[@id="empReview_2320868"]/div[1]/p[1]/tt').each do |link|
puts link.content
end

Upvotes: 2

Views: 272

Answers (3)

Arie Xiao
Arie Xiao

Reputation: 14082

You've passed xpath to a css selector.

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))
ps = doc.xpath('//div[@class="employerReview"]//div[@class="description"]/p[position()<3]')

ps.map{|p| p.text.strip}.each_slice(2) do |pros, cons|
  puts pros
  puts cons
end

The xpath specified has included the Pros - and Cons - part, if that's not what you want, you can change the xpath to be

//div[@class="employerReview"]//div[@class="description"]/p[position()<3]/tt

Upvotes: 0

summea
summea

Reputation: 7583

Here is one way to get closer to finding the data you are looking for by using CSS, instead of XPath:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651.htm'))

doc.css('div.employerReview > div.description > p > strong').each do |item|
  puts item.content
  item.parent.css('tt').each do |details|
    puts details.content
  end
end

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

One problem is you're using an XPath accessor for a method that expects CSS:

doc.css('//*[@id="empReview_2320868"]/div[1]/p[1]/tt')

You can use search or xpath for XPaths instead.

That doesn't find the nodes you want though. A simple test shows they don't exist:

doc.css("#empReview_2320868")

should return something, but it returns [], meaning that ID doesn't exist in any tags.

Upvotes: 1

Related Questions