How to get a full URL given a shortened one passed to Nokogiri?

I want to traverse some HTML documents with Nokogiri. After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.

def url = "http://ow.ly/hh8ri"     
doc = Nokogiri::HTML(open(url)
...

Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.

I want to know if the "doc" object has access to some URL as attribute or something. Does someone know a workaround?

By the way, I want the full URL because I'm traversing the HTML to find <img> tags and some have relative ones like: "/media/image/image.png", and then I adjust some using:

URI.join(url, relative_link_url).to_s

The image URL should be:

http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg

Instead of:

http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg

EDIT: IDEA

class Scraper < Nokogiri::HTML::Document
  attr_accessor :url

  class << self

    def new(url)
        html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
        self.parse(html).tap do |d|
            url = URI.parse(url)
            response = Net::HTTP.new(url.host, url.port)
            head = response.start do |r|
              r.head url.path
            end 
            d.url = head['location']
        end
    end
  end
end

Upvotes: 2

Answers (3)

the Tin Man

Reputation: 160553

Because your example is using OpenURI, that's the code to ask, not Nokogiri. Nokogiri has NO idea where the content came from.

OpenURI can tell you easily:

require 'open-uri'

starting_url = 'http://www.example.com'
final_uri = nil

puts "Starting URL: #{ starting_url }"

io = open(starting_url) { |io| final_uri = io.base_uri }
doc = io.read

puts "Final URL: #{ final_uri.to_s }"

Which outputs:

Starting URL: http://www.example.com
Final URL: http://www.iana.org/domains/example

base_uri is documented in the OpenURI::Meta module.

Upvotes: 2

pguardiario

Reputation: 54984

Use Mechanize. The URLs will always be converted to absolute:

require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://ow.ly/hh8ri'
page.images.map{|i| i.url.to_s}
#=> ["http://www.mp.rs.gov.br/images/imprensa/barra_area.gif", "http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg"]

Upvotes: 3

Chris Salzberg

Reputation: 27374

I had the exact same issue recently. What I did was to create a class that inherits from Nokogiri::HTML::Document, and then just override thenew class method to parse the document, then save the url in an instance variable with an accessor:

require 'nokogiri'
require 'open-uri'

class Webpage < Nokogiri::HTML::Document
  attr_accessor :url

  class << self

    def new(url)
      html = open(url)
      self.parse(html).tap do |d|
        d.url = url
      end
    end
  end
end

Then you can just create a new Webpage, and it will have access to all the normal methods you would have with a Nokogiri::HTML::Document:

w = Webpage.new("http://www.google.com")
w.url
#=> "http://www.google.com"
w.at_css('title')
#=> [#<Nokogiri::XML::Element:0x4952f78 name="title" children=[#<Nokogiri::XML::Text:0x4952cb2 "Google">]>]

If you have some relative url that you got from an image tag, you can then make it absolute by passing the return value of the url accessor to URI.join:

relative_link_url = "/media/image/image.png"
=> "/media/image/image.png"
URI.join(w.url, relative_link_url).to_s
=> "http://www.google.com/media/image/image.png"

Hope that helps.

p.s. the title of this question is quite misleading. Something more along the lines of "Accessing URL of Nokogiri HTML document" would be clearer.

Upvotes: 1

How to get a full URL given a shortened one passed to Nokogiri?

Answers (3)

Related Questions