Reputation: 1

Way to extract html and all download attachments from a website

I woule like to be able to run a script (or something) that will "download" a certain webpage (html) and all of its attachements (word docs) so that I can keep and operate a private collection.

Here is the story... There is this site that I use a lot for research. On this site there are many html pages that contain text and download links to documents (.pdf's and .docs). There is a threat that the owner (us gov) is going to 'privatize' the information, which I think is bogus. There is however this threat. I would like to be able to extract all the html text and copies of all the attachments so that I can host my own (on my desktop) version of the data for personal use (just in case). Is there a simple way to do this?

Note: I do not have FTP access to this webserver, only access to the individual webpages and attachments.

Upvotes: 0

Answers (2)

Joseph Sheedy

Reputation: 6726

I use wget for this purpose.

wget --mirror --no-parent http://remotesite.gov/documents/

The key when mirroring a portion of a site is to make sure not to ascend outside of the directory you're interested in. That's what the --no-parent flag does.

Upvotes: 1

Valentin Flachsel

Reputation: 10825

There are a ton of programs out there that are able to do this. Doing a Google search for "offline browser" will yield quite a bunch of results. Although I wouldn't be too keen to re-invent the wheel, for a self-built solution I would probably use the cURL library for PHP, but then again, it depends on what programming languages you're familiar with.

Hope this helps.

Upvotes: 1

Way to extract html and all download attachments from a website

Answers (2)

Related Questions