Reputation: 1173
I'm having issues trying to scrape search results from Google using Nokogiri and XPATH. It's just Google that I'm having issues with, other sites seem to be working fine.
I'm getting an elements XPATH string using Chrome's element inspector.
This is a working Stack Overflow example:
# Testing element on StackOverflow - returns the questions text
doc = Nokogiri::HTML(open('http://stackoverflow.com/questions/17763549/how-do-i-scrape-data-through-mechanize-and-nokogiri'))
p doc.at_xpath("//*[@id='question-header']/h1/a").text
=> "How do I scrape data through Mechanize and Nokogiri?"
Trying to use Google results in:
# Testing element on Google, should return the first result title
doc = Nokogiri::HTML(open('https://www.google.com/#q=stack+overflow+error'))
p doc.at_xpath("//*[@id='rso']/li[1]/div/h3/a").text
NoMethodError: undefined method `text' for nil:NilClass
from (irb):81
from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands/console.rb:47:in `start'
from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands/console.rb:8:in `start'
from /home/kyle/.rvm/gems/ruby-2.1.0/gems/railties-3.2.13/lib/rails/commands.rb:41:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'
I'm getting a "NoMethodError" on all Google pages. Any idea what's going on here?
Upvotes: 1
Views: 1013
Reputation: 1724
One of the problem that might cause the error is that you don't specify a user-agent
, thus Google blocks your request.
For example in Python requests
, the default user-agent
is python-requests
which needs to be changed, otherwise, it might block the request because it's a bot.
Also, Zach Kemp pointed that the URL is a bit wrong, e.g #q=
should be ?q=
require 'nokogiri'
require 'httparty'
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "stackoverflow error",
hl: "en"
}
response = HTTParty.get('https://www.google.com/search',
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
puts doc.at_xpath('//*[contains(@class, "yuRUbf")]/a/h3/text()')
# or at_css which is faster for class names and produces better XPath
puts puts doc.at_css(".yuRUbf/a/h3/text()")
---
#=> What is a StackOverflowError? - Stack Overflow
Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that all that needs to be done is just to iterate over structured json
string rather than figuring out stuff. Check out the playground.
require 'google_search_results'
params = {
api_key: "YOUR_API_KEY",
engine: "google",
q: "stackoverflow error",
hl: "en"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
# [0] first element from organic results
puts hash_results[:organic_results][0][:title]
---
#=> StackOverflowError (Java Platform SE 7 ) - Oracle Help Center
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 5019
Google does not return the data you're looking for in the response. This element is fetched with JavaScript when the page loaded by the browser. Nokogiri does not run any JavaScript on a page.
Upvotes: 1