using wget to mirror a website with path and subfolder that have the same name

Question

I'm trying to make a mirror of a website, but the URLs include several paths that overlap when copied to files on disk in the normal wget way. The problem manifests with URLs like http://example.com/news and http://example.com/news/article1.

Wget downloads these URLs as /news and /news/article1, but that means that the /news file is overwritten by a folder with the same name.

A proper static mirror would require that these two URLs be downloaded instead as /news/index.html and /news/article1.

I have tried to work around this problem by running wget twice and moving the files accordingly, but that hasn't worked well for me. The /news path has links to /news/article1 that need to be converted. I'm using the -k option to convert links, but if I run wget twice, it doesn't convert the links between these unrelated downloaded files.

Here's my command:

wget -p -r -l4 -k -d -nH http://example.com

Here's an example of the work around that I've tried:

# wget once at first level (gets /news path but not /news/*)
wget -p -r -l1 -k -nH http://example.com

# move /news file to temp path
mv news /tmp/news.html

# wget again to get everything else (notice the different level value)
wget -p -r -l4 -k -nH http://example.com

# move temp path back to /news/index.html
mv /tmp/news.html news/index.html

In the above example, the links on the /news page that are supposed to point to /news/article1 have not been converted.

Does anybody know how to work around this with wget? Is there a different tool that would work better?

dayer4b · Accepted Answer

I figured it out!

The problem was my assumption that /news/index.html was the URL that I needed. After closely reading the man page, I found that -E (--adjust-extension) solved my problem. This flag forces wget to apply the .html extension onto all of the HTML files that it downloads.

Coupling that with -k to convert the links results in a 100% usable mirror that has all of the pages needed.

Here's an example map of the downloaded files and paths:

http://example.com/news           -->  /news.html
http://example.com/news/article1  -->  /news/article1.html

As a functional mirror, this is great. Default webserver configurations (at least for Apache) seem to allow the path http://sitemirror.com/news/article1 to load the /news/article1.html content. However, it may be necessary for a rewrite to keep the http:/sitemirror.com/news path from displaying a 404 or index for the folder. This should not be tough.

Oh, so here's my final wget command:

wget -p -r -l4 -E -k -nH http://example.com

using wget to mirror a website with path and subfolder that have the same name

Answers (2)

Related Questions