Reputation: 987
I'm trying to make a mirror of a website, but the URLs include several paths that overlap when copied to files on disk in the normal wget
way. The problem manifests with URLs like http://example.com/news
and http://example.com/news/article1
.
Wget downloads these URLs as /news
and /news/article1
, but that means that the /news
file is overwritten by a folder with the same name.
A proper static mirror would require that these two URLs be downloaded instead as /news/index.html
and /news/article1
.
I have tried to work around this problem by running wget
twice and moving the files accordingly, but that hasn't worked well for me. The /news
path has links to /news/article1
that need to be converted. I'm using the -k
option to convert links, but if I run wget
twice, it doesn't convert the links between these unrelated downloaded files.
Here's my command:
wget -p -r -l4 -k -d -nH http://example.com
Here's an example of the work around that I've tried:
# wget once at first level (gets /news path but not /news/*)
wget -p -r -l1 -k -nH http://example.com
# move /news file to temp path
mv news /tmp/news.html
# wget again to get everything else (notice the different level value)
wget -p -r -l4 -k -nH http://example.com
# move temp path back to /news/index.html
mv /tmp/news.html news/index.html
In the above example, the links on the /news
page that are supposed to point to /news/article1
have not been converted.
Does anybody know how to work around this with wget
? Is there a different tool that would work better?
Upvotes: 5
Views: 3853
Reputation: 262
Let's say you'd like to mirror an entire website with wget, and keep the naming scheme of the original website. That is, don't rename files by adding an .html extension.
A problem occurs if a file and directory conflict. The file will be overwritten when a directory is created.
How about this solution:
wget --mirror
The script below generates index.html
files when there is a file/directory name overlap.
#!/bin/bash
function processdir() {
website="https://www.example.com"
echo "The dir name is $1";
if [ -e $1/index.html ]; then
echo "$1/index.html already exists"
else
echo "Downloading $website/$1 to $1/index.html"
rm /tmp/index.html
wget --quiet -O /tmp/index.html $website/$1
if [ "$?" = "0" ]; then
echo "Download succeed. Copying file into place."
echo "cp /tmp/index.html $1/index.html"
cp /tmp/index.html $1/index.html
else
echo "Download failed."
fi
ls -al $1/index.html
fi
}
export -f processdir
find . -type d -name '*' -exec /bin/bash -c 'processdir "$0"' {} \;
Upvotes: 0
Reputation: 987
I figured it out!
The problem was my assumption that /news/index.html
was the URL that I needed. After closely reading the man page, I found that -E (--adjust-extension)
solved my problem. This flag forces wget
to apply the .html
extension onto all of the HTML files that it downloads.
Coupling that with -k
to convert the links results in a 100% usable mirror that has all of the pages needed.
Here's an example map of the downloaded files and paths:
http://example.com/news --> /news.html
http://example.com/news/article1 --> /news/article1.html
As a functional mirror, this is great. Default webserver configurations (at least for Apache) seem to allow the path http://sitemirror.com/news/article1
to load the /news/article1.html
content. However, it may be necessary for a rewrite to keep the http:/sitemirror.com/news
path from displaying a 404 or index for the folder. This should not be tough.
Oh, so here's my final wget
command:
wget -p -r -l4 -E -k -nH http://example.com
Upvotes: 4