skrippyfingaz
skrippyfingaz

Reputation: 323

In need of an explanation of Web scraping with Nokogiri in Rails

I am utterly confused and lost with Nokogiri and web scraping in Rails. I need someone to explain to me how I can get article titles from a web site to list in a view in my Rails application. I can manage to retrieve the data in irb however I have no clue how I can get that same data to be displayed in a view I made.

I have watched a number of tutorials and read documentation and one thing that they do that confuses me the most is when they require nokogiri or open-uri in a their example ruby file what directory is that ruby file supposed to be placed in? Also is that file associated with any controller for it to be displayed in the particular view that I made?

I hope I am explaining my issue as clear as possible without any confusion as I am not trying to confuse myself anymore that i am in my explanation.

See, what I am trying to do is create an application where the user can register and sign in, after they are signed in they are redirected to a page with 3 links. Those links being Audi, BMW and Mercedes-Benz and depending on which link is clicked the user will be then directed to another page where they are returned back a list of articles that mention their desired choice.

I hope this explanation was helpful and I really hope someone can offer to help or give me some kind of documentation that will benefit me.

Thank you!

This is what I did in irb:

2.1.1 :001 > require 'rubygems'
 => false 
2.1.1 :002 > require 'nokogiri'
 => true 
2.1.1 :003 > require 'open-uri'
 => true 
2.1.1 :004 > page = Nokogiri::HTML(open("http://www.dtm.com/de/News/Archiv/index.html")) 

I then got this returned:

=> #<Nokogiri::HTML::Document:0x814e3b40 name="document" children=[#<Nokogiri::XML::DTD:0x814e37f8 name="HTML">, #<Nokogiri::XML::Element:0x814e358c name="html" children=[#<Nokogiri::XML::Text:0x814e3384 "\r\n">, #<Nokogiri::XML::Element:0x814e32d0 name="head" children=[#<Nokogiri::XML::Text:0x814e30f0 "\r\n">, #<Nokogiri::XML::Element:0x814e3028 name="title" children=[#<Nokogiri::XML::Text:0x814e2e48 "DTM | Newsarchiv">]>, #<Nokogiri::XML::Text:0x814e2c90 "\r\n">, #<Nokogiri::XML::Element:0x814e2bc8 name="meta" attributes=[#<Nokogiri::XML::Attr:0x814e2b64 name="charset" value="utf-8">]>, #<Nokogiri::XML::Text:0x814e2718 "\r\n">, #<Nokogiri::XML::Element:0x814e2664 name="meta" ...

(I got more but just put up a few lines of what was returned) I am assuming this is the raw data from the page.

I then put:

2.1.1 :008 > puts page

Which returned back the raw HTML content.

Finally I entered:

2.1.1 :014 > page.css("a")

Which returned back the all the links on the page.

Upvotes: 0

Views: 717

Answers (2)

Spasm
Spasm

Reputation: 805

I am hoping to help you with a real world example. Lets get some data from Reuters for example.

In your console try this:

    # require your tools make sure you have gem install nokogiri
    pry(main)> require 'nokogiri'
    pry(main)> require 'open-uri'

    # set the url
    pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"

    # load and assign to a variable
    pry(main)> doc = Nokogiri::HTML(open(url))

    # take a piece of the site that has an element style .sectionQuote you can use ids also
    pry(main)> quote = doc.css(".sectionQuote")

Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:

    pry(main)> quote.size
    => 6

    pry(main)> quote.first
    => #(Element:0x43ff468 {
    name = "div",
    attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote   nasdaqChange" })],
    children = [
      #(Text "\n\t\t\t"),
      #(Element:0x43fef18 {
        name = "div",
        attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
        children = [
          #(Text "\n\t\t\t\t"),
          #(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
    .....
  }),
#(Text "\n\t\t")]

})

You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.

if you want to just simply display this div element you can:

pry(main)> quote.first.to_html
=> "<div class=\"sectionQuote nasdaqChange\">\n\t\t\t<div class=\"sectionQuoteDetail\">\n\t\t\t\t<span class=\"nasdaqChangeHeader\">0005.HK on Hong Kong Stock</span>\n\t\t\t\t<br class=\"clear\"><br class=\"clear\">\n\t\t\t\t<span style=\"font-size: 23px;\">\n\t\t\t\t82.85</span><span>HKD</span><br>\n\t\t\t\t<span class=\"nasdaqChangeTime\">14 Aug 2014</span>\n\t\t\t</div>\n\t\t</div>"

and it is possible to use it directly in the view of a rails application.

if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:

 pry(main)> quote.each{|p| puts p.inspect}

Or be very specific and get the value of an element ie the name of the stock in our example:

 pry(main)> quote.at_css(".nasdaqChangeHeader").content
 => "0005.HK on Hong Kong Stock"

This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html

Really hope this helps

PS: A tip for looking inside objects (http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)

puts quote.inspect

Upvotes: 0

sergiotp
sergiotp

Reputation: 21

First, you can put nokogiri and openuri in the gemfile of your rails app, with that in place you don't need to require these libraries.

You flow to scrape the sites should be:

# put this code on your controller
web_site = params[:web_site] # could be http://www.bmw.com/com/en/
@doc = Nokogiri::HTML(open(web_site))

#then you can iterate over the document in your view
<% @doc.css('.standardTeaser').each do |teaser_bmw| %>
  <p>teaser_bmw.css('.headline').text </p>
  #other content of teaser you can search here
<% end %>

So, to scrape the web site you need to fetch the html from the web site and find what content you want to grab. If you know some basics of css selector it will be very easy to do. Me example doesn't take in account if you want to save the data in a database... but if you want, you just need to create a table with the field you need to save and than create a record after parsing the html.

Is that made sense to you?

Upvotes: 0

Related Questions