In need of an explanation of Web scraping with Nokogiri in Rails

Question

I am utterly confused and lost with Nokogiri and web scraping in Rails. I need someone to explain to me how I can get article titles from a web site to list in a view in my Rails application. I can manage to retrieve the data in irb however I have no clue how I can get that same data to be displayed in a view I made.

I have watched a number of tutorials and read documentation and one thing that they do that confuses me the most is when they require nokogiri or open-uri in a their example ruby file what directory is that ruby file supposed to be placed in? Also is that file associated with any controller for it to be displayed in the particular view that I made?

I hope I am explaining my issue as clear as possible without any confusion as I am not trying to confuse myself anymore that i am in my explanation.

See, what I am trying to do is create an application where the user can register and sign in, after they are signed in they are redirected to a page with 3 links. Those links being Audi, BMW and Mercedes-Benz and depending on which link is clicked the user will be then directed to another page where they are returned back a list of articles that mention their desired choice.

I hope this explanation was helpful and I really hope someone can offer to help or give me some kind of documentation that will benefit me.

Thank you!

This is what I did in irb:

2.1.1 :001 > require 'rubygems'
 => false 
2.1.1 :002 > require 'nokogiri'
 => true 
2.1.1 :003 > require 'open-uri'
 => true 
2.1.1 :004 > page = Nokogiri::HTML(open("http://www.dtm.com/de/News/Archiv/index.html"))

I then got this returned:

=> #, #, #, #]>, #, #]>, #, #



(I got more but just put up a few lines of what was returned) I am assuming this is the raw data from the page. 

I then put:

2.1.1 :008 > puts page


Which returned back the raw HTML content.

Finally I entered:

2.1.1 :014 > page.css("a")


Which returned back the all the links on the page.

Spasm · Accepted Answer

I am hoping to help you with a real world example. Lets get some data from Reuters for example.

In your console try this:

    # require your tools make sure you have gem install nokogiri
    pry(main)> require 'nokogiri'
    pry(main)> require 'open-uri'

    # set the url
    pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"

    # load and assign to a variable
    pry(main)> doc = Nokogiri::HTML(open(url))

    # take a piece of the site that has an element style .sectionQuote you can use ids also
    pry(main)> quote = doc.css(".sectionQuote")

Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:

    pry(main)> quote.size
    => 6

    pry(main)> quote.first
    => #(Element:0x43ff468 {
    name = "div",
    attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote   nasdaqChange" })],
    children = [
      #(Text "
			"),
      #(Element:0x43fef18 {
        name = "div",
        attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
        children = [
          #(Text "
				"),
          #(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
    .....
  }),
#(Text "
		")]

})

You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.

if you want to just simply display this div element you can:

pry(main)> quote.first.to_html
=> "
			
				0005.HK on Hong Kong Stock
				


				
				82.85HKD

				14 Aug 2014
			
		"

and it is possible to use it directly in the view of a rails application.

if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:

 pry(main)> quote.each{|p| puts p.inspect}

Or be very specific and get the value of an element ie the name of the stock in our example:

 pry(main)> quote.at_css(".nasdaqChangeHeader").content
 => "0005.HK on Hong Kong Stock"

This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html

Really hope this helps

PS: A tip for looking inside objects (http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)

puts quote.inspect

In need of an explanation of Web scraping with Nokogiri in Rails

Answers (2)

Related Questions