Reputation: 952
To save the HTML of a web page using Ruby, it's very easy.
One way to do is by using rio:
require 'rubygems'
require 'rio'
rio('http://www.google.com') > rio('google.html')
Is it possible to do the same for by parsing the html, requesting again the different images, javascript, css and then save each of them?
I think it is not very efficient.
So, is there a way to save a web page + all the images, css, and javascript that are related to that page, and all this automatically?
Upvotes: 2
Views: 2445
Reputation: 192
url = "docs.zillabyte.com"
output_dir = "/tmp/crawl"
# -E = adjust malformed extensions (e.g. /some_image/ -> /some_image.gif)
# -H = span hosts (e.g. include assets from other domains)
# -p = download all assets associated with the page
# -P = output prefix (a.k.a the directory to dump the assets)
system("wget -E -H -p '#{url}' -P '#{output_dir}'")
# read files from 'output_dir'
Upvotes: 0
Reputation: 709
Most time we can use the system's tools. Like dimus said, you can use the wget to download page.
And there are many useful api for solving the Net problem. Such as net/ftp, net/http or net/https. You can see the document for detail. Net/HTTP .But these methods only get the response, what we need do more is parsing the HTML document. Even more using the mozilla's lib is a good way.
Upvotes: 0