Kyle
Kyle

Reputation: 1173

Problems scraping Google results with Nokogiri and XPATH

I'm having issues trying to scrape search results from Google using Nokogiri and XPATH. It's just Google that I'm having issues with, other sites seem to be working fine.

I'm getting an elements XPATH string using Chrome's element inspector.

This is a working Stack Overflow example:

# Testing element on StackOverflow - returns the questions text
doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/17763549/how-do-i-scrape-data-through-mechanize-and-nokogiri'))

p doc.at_xpath("//*[@id='question-header']/h1/a").text
=> "How do I scrape data through Mechanize and Nokogiri?" 

Trying to use Google results in:

# Testing element on Google, should return the first result title
doc = Nokogiri::HTML(open('https://www.google.com/#q=stack+overflow+error'))

p doc.at_xpath("//*[@id='rso']/li[1]/div/h3/a").text
NoMethodError: undefined method `text' for nil:NilClass
  from (irb):81
  from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands/console.rb:47:in `start'
  from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands/console.rb:8:in `start'
  from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands.rb:41:in `<top (required)>'
  from script/rails:6:in `require'
  from script/rails:6:in `<main>'

I'm getting a "NoMethodError" on all Google pages. Any idea what's going on here?

Upvotes: 1

Views: 1013

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

One of the problem that might cause the error is that you don't specify a user-agent, thus Google blocks your request.

For example in Python requests, the default user-agent is python-requests which needs to be changed, otherwise, it might block the request because it's a bot.

Also, Zach Kemp pointed that the URL is a bit wrong, e.g #q= should be ?q=

require 'nokogiri'
require 'httparty'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "stackoverflow error",
  hl: "en"
}

response = HTTParty.get('https://www.google.com/search',
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)

puts doc.at_xpath('//*[contains(@class, "yuRUbf")]/a/h3/text()')

# or at_css which is faster for class names and produces better XPath
puts puts doc.at_css(".yuRUbf/a/h3/text()")

---
#=> What is a StackOverflowError? - Stack Overflow

Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that all that needs to be done is just to iterate over structured json string rather than figuring out stuff. Check out the playground.

require 'google_search_results' 

params = {
  api_key: "YOUR_API_KEY",
  engine: "google",
  q: "stackoverflow error",
  hl: "en"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

# [0] first element from organic results
puts hash_results[:organic_results][0][:title]

---
#=> StackOverflowError (Java Platform SE 7 ) - Oracle Help Center

Disclaimer, I work for SerpApi.

Upvotes: 0

Pafjo
Pafjo

Reputation: 5019

Google does not return the data you're looking for in the response. This element is fetched with JavaScript when the page loaded by the browser. Nokogiri does not run any JavaScript on a page.

Upvotes: 1

Related Questions