Reputation: 478
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
--mirror
: Recursively download the entire site-p
: Download all necessary page requisites-k
: Convert the URL's to relative paths so I can host them anywhereSome things are being downloaded more than once, which results in myfile.html
and myfile.1.html
. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html
version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc
option would prevent this, but as of wget-v1.13, I cannot use -k
and -nc
at the same time. Details for this are here.
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!
Upvotes: 5
Views: 4660
Reputation: 21
According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.
What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.
Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.
An example, based on your original:
wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite
Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.
Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.
I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.
Upvotes: 1
Reputation: 478
httrack
got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html
instead of /folder/
.
Using either httrack
or wget
didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed
to clean up some of the URLS (crop the index.html
from links, replace bla.1.html
with bla.html
, etc.)
Upvotes: 2