chickenman
chickenman

Reputation: 798

Webscraping Nokogiri unable to pick any classes

I am using this page: https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8

I am trying to get the this element: class="_XWk"

  page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')

  parse_page = Nokogiri::HTML(page)
  parse_page.css('_XWk')

Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?

Upvotes: 1

Views: 84

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.

It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.

In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.

Pass a user-agent:

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

HTTParty.get("https://www.google.com/search", headers: headers)

Iterate over container to extract titles from Google Search:

data = doc.css(".tF2Cxc").map do |result|
  title = result.at_css(".DKV0Md")&.text

Code and example in the online IDE:

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "ford fusion msrp",
  num: "20"
}

response = HTTParty.get("https://www.google.com/search",
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)

data = doc.css(".tF2Cxc").map do |result|
  title = result.at_css(".DKV0Md")&.text
  link = result.at_css(".yuRUbf a")&.attr("href")
  displayed_link = result.at_css(".tjvcx")&.text
  snippet = result.at_css(".VwiC3b")&.text
  puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"

-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford

Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''

Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.

Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.

Example code:

require 'google_search_results' 

params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "ford fusion msrp",
  hl: "en",
  num: "20"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

data = hash_results[:organic_results].map do |result|
  title = result[:title]
  link = result[:link]
  displayed_link = result[:displayed_link]
  snippet = result[:snippet]
  puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"

-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford

Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''

P.S - I wrote a blog post about how to scrape Google Organic Search Results.

Disclaimer, I work for SerpApi.

Upvotes: 1

ran
ran

Reputation: 11

Change parse_page.css('_XWk') to parse_page.css('._XWk')

Note the dot (.) difference. The dot references a class.

Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..

Upvotes: 0

s1mpl3
s1mpl3

Reputation: 1464

It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA

  page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
  parse_page = Nokogiri::HTML(page)
  parse_page.css('._tA').map(&:text) 

# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]

Upvotes: 0

Related Questions