Reputation: 10744
I'm trying parse this url:
http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true
I paste console results:
uri = "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true"
n = Nokogiri::HTML(uri)
=> #<Nokogiri::HTML::Document:0x65af7b6 name="document" children=[#<Nokogiri::XML::DTD:0x65af04a name="html">, #<Nokogiri::XML::Element:0x65adf56 name="html" children=[#<Nokogiri::XML::Element:0x64f98e4 name="body" children=[#<Nokogiri::XML::Element:0x64f96aa name="p" children=[#<Nokogiri::XML::Text:0x64f951a "http://abantia.cvtools.com/persona/WebLinkEntryPoint.php?idowner=36054&code=DetalleOferta&idofe=140544&no_links=true">]>]>]>]>
irb(main):115:0> n.css("#contenido")
=> []
irb(main):119:0> n.css("title")
=> []
I'm getting an empty html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
With other webpages I have not this problem.
Where is the error?
Upvotes: 1
Views: 503
Reputation: 3388
Try this:
require 'open-uri'
n = Nokogiri::HTML(open(uri))
In your original call, you are parsing the URL as a string....but you need to fetch and open the contents of that URL for Nokogiri
To elaborate on the comments, this is what your original call retrieves:
Nokogiri::HTML(uri)
=> #(Document:0x3fe9fdc3a2e0 {
name = "document",
children = [
#(DTD:0x3fe9fdc3b4c4 { name = "html" }),
#(Element:0x3fe9fdc40488 {
name = "html",
children = [
#(Element:0x3fe9fdc45974 {
name = "body",
children = [
#(Element:0x3fe9fdc475bc {
name = "p",
children = [
#(Text "http://abantia.cvtools.com/persona/Oferta.mostrar.php?idofe=140544&no_links=true")]
})]
})]
})]
})
Here's my version, with the open
call
Nokogiri::HTML(open(uri))
=> #(Document:0x3fe9fe012980 {
name = "document",
children = [
#(DTD:0x3fe9fe0162b0 { name = "html" }),
#(Element:0x3fe9fe0153ec {
name = "html",
children = [
#(Element:0x3fe9fdc21470 {
name = "body",
children = [
#(Element:0x3fe9fdc2087c {
name = "header",
children = [
#(Element:0x3fe9fdc23838 {
name = "meta",
attributes = [
#(Attr:0x3fe9fdc22f50 {
name = "http-equiv",
value = "Refresh"
}),
#(Attr:0x3fe9fdc22f28 {
name = "content",
value = "0; URL=Session.timeout.php?log=0&referer=%2Fperso
})]
})]
})]
})]
})]
})
Technically, they are both not the results you want, but for two different reasons. Your original call will never work as intended no matter what page you're on. The example I've given you will work on pages that require authentication. And for pages that require authentication and login, you want to use Mechanize to transparently handle the form-login.
However you really need to understand for yourself the difference between the code you posted and my fix, because that is absolutely crucial to moving forward.
Upvotes: 1
Reputation: 33076
Your queries yield empty results because the page you are trying to access requires authentication. If you inspect the network flow, you will see that you get an empty response. If you repeat the step by pasting the URL in a browser, you will be soon redirected to an error page whose message is a good hint about the missing authentication:
Su sesión ha caducado
Para seguir utilizando estas páginas debe volver a la página inicial y continuar normalmente
Unfortunately, there is no "standard" way of logging into websites. In order to perform automatic login, you should look for some Ruby equivalent of the great Mechanize Python library.
Upvotes: 2