Reputation: 1
I woule like to be able to run a script (or something) that will "download" a certain webpage (html) and all of its attachements (word docs) so that I can keep and operate a private collection.
Here is the story... There is this site that I use a lot for research. On this site there are many html pages that contain text and download links to documents (.pdf's and .docs). There is a threat that the owner (us gov) is going to 'privatize' the information, which I think is bogus. There is however this threat. I would like to be able to extract all the html text and copies of all the attachments so that I can host my own (on my desktop) version of the data for personal use (just in case). Is there a simple way to do this?
Note: I do not have FTP access to this webserver, only access to the individual webpages and attachments.
Upvotes: 0
Views: 3963
Reputation: 6726
I use wget for this purpose.
wget --mirror --no-parent http://remotesite.gov/documents/
The key when mirroring a portion of a site is to make sure not to ascend outside of the directory you're interested in. That's what the --no-parent flag does.
Upvotes: 1
Reputation: 10825
There are a ton of programs out there that are able to do this. Doing a Google search for "offline browser" will yield quite a bunch of results. Although I wouldn't be too keen to re-invent the wheel, for a self-built solution I would probably use the cURL library for PHP, but then again, it depends on what programming languages you're familiar with.
Hope this helps.
Upvotes: 1