jasonbogd
jasonbogd

Reputation: 2501

How to backup entire web page (including images, etc.) in Ruby script?

If I have a URL of a webpage, how can I download it to locally, including all the images, stylesheets, etc? Would I have to manually parse the HTML and figure out all the external resources? Or is there a cleaner way?

Thanks!

Upvotes: 4

Views: 2595

Answers (3)

Michael Granger
Michael Granger

Reputation: 1368

You can do this fairly easily (albeit not as easily as just learning to use 'wget') with Net::HTTP and Nokogiri:

require 'nokogiri'
require 'net/http'
require 'pathname'

# Set to the host and the path of the HTML file
host = 'rubygems.org'
path = '/'

# Fetch the page and parse it
source = Net::HTTP.get( host, path )
page   = Nokogiri::HTML( source )
dir    = Pathname( path ).dirname

# Download images
page.xpath( '//img[@src]' ).each do |imgtag|
    localpath = Pathname( imgtag[:src] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, imgtag[:src], fh )
    end
end

# Download stylesheets
page.xpath( '//link[@rel="stylesheet"]' ).each do |linktag|
    localpath = Pathname( linktag[:href] ).relative_path_from( dir )
    localpath.mkpath
    localpath.open( 'w' ) do |fh|
        Net::HTTP.get_print( host, linktag[:href], fh )
    end
end

You'd obviously need better error-checking, and the resource-fetching code needs to be pulled up into a method, but if you really want to do this from Ruby, it's certainly possible.

Upvotes: 4

the Tin Man
the Tin Man

Reputation: 160631

This is one of those times I'd look elsewhere. Not that it can't be done in Ruby, but there are other existing tools made for this that do it very well. Why reinvent a wheel?

Look at wget. It is a standard tool for retrieving web resources, including mirroring sites and is available on all platforms. From the docs:

Retrieve only one html page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

wget -p --convert-links http://www.server.com/dir/page.html

The html page will be saved to www.server.com/dir/page.html, and the images, stylesheets, etc., somewhere under www.server.com/, depending on where they were on the remote server.

You could easily call wget from within a Ruby script using backticks or %x:

`/path/to/wget -p --convert-links http://www.server.com/dir/page.html`

or

%x{/path/to/wget -p --convert-links http://www.server.com/dir/page.html}

There are a lot of other mechanisms to do the same thing in Ruby, which give you more control.

Upvotes: 5

mikez
mikez

Reputation: 160

Well if you're just doing a few instances, I don't think you'll need a script. You can simply just save the web page using any web browser and it'll download the necessary images and style sheets etc. Or in chrome, you can browse all the resources used in a single webpage.

Upvotes: -2

Related Questions