empty html with nokogiri

Question

I'm trying parse this url:

http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true

I paste console results:

uri = "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true"

n = Nokogiri::HTML(uri)
=> #, #]>]>]>]>
irb(main):115:0> n.css("#contenido")
=> []
irb(main):119:0> n.css("title")
=> []

I'm getting an empty html:

With other webpages I have not this problem.

Where is the error?

dancow · Accepted Answer

Try this:

   require 'open-uri'
   n = Nokogiri::HTML(open(uri))

In your original call, you are parsing the URL as a string....but you need to fetch and open the contents of that URL for Nokogiri

To elaborate on the comments, this is what your original call retrieves:

Nokogiri::HTML(uri)
=> #(Document:0x3fe9fdc3a2e0 {
  name = "document",
  children = [
    #(DTD:0x3fe9fdc3b4c4 { name = "html" }),
    #(Element:0x3fe9fdc40488 {
      name = "html",
      children = [
        #(Element:0x3fe9fdc45974 {
          name = "body",
          children = [
            #(Element:0x3fe9fdc475bc {
              name = "p",
              children = [
                #(Text "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true")]
              })]
          })]
      })]
  })

Here's my version, with the open call

Nokogiri::HTML(open(uri))
=> #(Document:0x3fe9fe012980 {
  name = "document",
  children = [
    #(DTD:0x3fe9fe0162b0 { name = "html" }),
    #(Element:0x3fe9fe0153ec {
      name = "html",
      children = [
        #(Element:0x3fe9fdc21470 {
          name = "body",
          children = [
            #(Element:0x3fe9fdc2087c {
              name = "header",
              children = [
                #(Element:0x3fe9fdc23838 {
                  name = "meta",
                  attributes = [
                    #(Attr:0x3fe9fdc22f50 {
                      name = "http-equiv",
                      value = "Refresh"
                      }),
                    #(Attr:0x3fe9fdc22f28 {
                      name = "content",
                      value = "0; URL=Session.timeout.php?log=0&referer=%2Fperso
                      })]
                  })]
              })]
          })]
      })]
  })

Technically, they are both not the results you want, but for two different reasons. Your original call will never work as intended no matter what page you're on. The example I've given you will work on pages that require authentication. And for pages that require authentication and login, you want to use Mechanize to transparently handle the form-login.

However you really need to understand for yourself the difference between the code you posted and my fix, because that is absolutely crucial to moving forward.

empty html with nokogiri

Answers (2)

Related Questions