massinissa
massinissa

Reputation: 952

ruby + save web page

To save the HTML of a web page using Ruby, it's very easy.

One way to do is by using rio:

require 'rubygems'
require 'rio'
rio('http://www.google.com') > rio('google.html')

Is it possible to do the same for by parsing the html, requesting again the different images, javascript, css and then save each of them?

I think it is not very efficient.

So, is there a way to save a web page + all the images, css, and javascript that are related to that page, and all this automatically?

Upvotes: 2

Views: 2445

Answers (3)

jake256
jake256

Reputation: 192

url = "docs.zillabyte.com"
output_dir = "/tmp/crawl"

# -E = adjust malformed extensions (e.g. /some_image/ -> /some_image.gif)
# -H = span hosts (e.g. include assets from other domains) 
# -p = download all assets associated with the page
# -P = output prefix (a.k.a the directory to dump the assets)
system("wget -E -H -p '#{url}' -P '#{output_dir}'")

# read files from 'output_dir'

Upvotes: 0

Qianjigui
Qianjigui

Reputation: 709

Most time we can use the system's tools. Like dimus said, you can use the wget to download page.

And there are many useful api for solving the Net problem. Such as net/ftp, net/http or net/https. You can see the document for detail. Net/HTTP .But these methods only get the response, what we need do more is parsing the HTML document. Even more using the mozilla's lib is a good way.

Upvotes: 0

dimus
dimus

Reputation: 9390

what about system("wget -r -l 1 http://google.com")

Upvotes: 2

Related Questions