hyperrjas
hyperrjas

Reputation: 10744

empty html with nokogiri

I'm trying parse this url:

http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true

I paste console results:

uri = "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true"

n = Nokogiri::HTML(uri)
=> #<Nokogiri::HTML::Document:0x65af7b6 name="document" children=[#<Nokogiri::XML::DTD:0x65af04a name="html">, #<Nokogiri::XML::Element:0x65adf56 name="html" children=[#<Nokogiri::XML::Element:0x64f98e4 name="body" children=[#<Nokogiri::XML::Element:0x64f96aa name="p" children=[#<Nokogiri::XML::Text:0x64f951a "http://abantia.cvtools.com/persona/WebLinkEntryPoint.php?idowner=36054&code=DetalleOferta&idofe=140544&no_links=true">]>]>]>]>
irb(main):115:0> n.css("#contenido")
=> []
irb(main):119:0> n.css("title")
=> []

I'm getting an empty html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

With other webpages I have not this problem.

Where is the error?

Upvotes: 1

Views: 503

Answers (2)

dancow
dancow

Reputation: 3388

Try this:

   require 'open-uri'
   n = Nokogiri::HTML(open(uri))

In your original call, you are parsing the URL as a string....but you need to fetch and open the contents of that URL for Nokogiri

To elaborate on the comments, this is what your original call retrieves:

Nokogiri::HTML(uri)
=> #(Document:0x3fe9fdc3a2e0 {
  name = "document",
  children = [
    #(DTD:0x3fe9fdc3b4c4 { name = "html" }),
    #(Element:0x3fe9fdc40488 {
      name = "html",
      children = [
        #(Element:0x3fe9fdc45974 {
          name = "body",
          children = [
            #(Element:0x3fe9fdc475bc {
              name = "p",
              children = [
                #(Text "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true")]
              })]
          })]
      })]
  })

Here's my version, with the open call

Nokogiri::HTML(open(uri))
=> #(Document:0x3fe9fe012980 {
  name = "document",
  children = [
    #(DTD:0x3fe9fe0162b0 { name = "html" }),
    #(Element:0x3fe9fe0153ec {
      name = "html",
      children = [
        #(Element:0x3fe9fdc21470 {
          name = "body",
          children = [
            #(Element:0x3fe9fdc2087c {
              name = "header",
              children = [
                #(Element:0x3fe9fdc23838 {
                  name = "meta",
                  attributes = [
                    #(Attr:0x3fe9fdc22f50 {
                      name = "http-equiv",
                      value = "Refresh"
                      }),
                    #(Attr:0x3fe9fdc22f28 {
                      name = "content",
                      value = "0; URL=Session.timeout.php?log=0&referer=%2Fperso
                      })]
                  })]
              })]
          })]
      })]
  })

Technically, they are both not the results you want, but for two different reasons. Your original call will never work as intended no matter what page you're on. The example I've given you will work on pages that require authentication. And for pages that require authentication and login, you want to use Mechanize to transparently handle the form-login.

However you really need to understand for yourself the difference between the code you posted and my fix, because that is absolutely crucial to moving forward.

Upvotes: 1

Stefano Sanfilippo
Stefano Sanfilippo

Reputation: 33076

Your queries yield empty results because the page you are trying to access requires authentication. If you inspect the network flow, you will see that you get an empty response. If you repeat the step by pasting the URL in a browser, you will be soon redirected to an error page whose message is a good hint about the missing authentication:

Su sesión ha caducado

Para seguir utilizando estas páginas debe volver a la página inicial y continuar normalmente

Unfortunately, there is no "standard" way of logging into websites. In order to perform automatic login, you should look for some Ruby equivalent of the great Mechanize Python library.

Upvotes: 2

Related Questions