dayer4b
dayer4b

Reputation: 987

using wget to mirror a website with path and subfolder that have the same name

I'm trying to make a mirror of a website, but the URLs include several paths that overlap when copied to files on disk in the normal wget way. The problem manifests with URLs like http://example.com/news and http://example.com/news/article1.

Wget downloads these URLs as /news and /news/article1, but that means that the /news file is overwritten by a folder with the same name.

A proper static mirror would require that these two URLs be downloaded instead as /news/index.html and /news/article1.

I have tried to work around this problem by running wget twice and moving the files accordingly, but that hasn't worked well for me. The /news path has links to /news/article1 that need to be converted. I'm using the -k option to convert links, but if I run wget twice, it doesn't convert the links between these unrelated downloaded files.

Here's my command:

wget -p -r -l4 -k -d -nH http://example.com

Here's an example of the work around that I've tried:

# wget once at first level (gets /news path but not /news/*)
wget -p -r -l1 -k -nH http://example.com

# move /news file to temp path
mv news /tmp/news.html

# wget again to get everything else (notice the different level value)
wget -p -r -l4 -k -nH http://example.com

# move temp path back to /news/index.html
mv /tmp/news.html news/index.html

In the above example, the links on the /news page that are supposed to point to /news/article1 have not been converted.

Does anybody know how to work around this with wget? Is there a different tool that would work better?

Upvotes: 5

Views: 3853

Answers (2)

Sam
Sam

Reputation: 262

Let's say you'd like to mirror an entire website with wget, and keep the naming scheme of the original website. That is, don't rename files by adding an .html extension.

A problem occurs if a file and directory conflict. The file will be overwritten when a directory is created.

How about this solution:

  1. Mirror the website using wget --mirror
  2. Then, as a second step, go back and download the problematic files. (Not all files.) So, if there was a wiki/ directory that had a main page, which should now be saved as "wiki/index.html" instead of plain "wiki/", download only that page.

The script below generates index.html files when there is a file/directory name overlap.

#!/bin/bash

function processdir() {
    website="https://www.example.com"
    echo "The dir name is $1";
    if [ -e $1/index.html ]; then
        echo "$1/index.html already exists"
    else
        echo "Downloading $website/$1 to $1/index.html"
        rm /tmp/index.html
        wget --quiet -O /tmp/index.html $website/$1
        if [ "$?" = "0" ]; then
            echo "Download succeed. Copying file into place."
            echo "cp /tmp/index.html $1/index.html"
            cp /tmp/index.html $1/index.html
        else
            echo "Download failed."
        fi
        ls -al $1/index.html
    fi
}
export -f processdir
find . -type d -name '*' -exec /bin/bash -c 'processdir "$0"' {} \;

Upvotes: 0

dayer4b
dayer4b

Reputation: 987

I figured it out!

The problem was my assumption that /news/index.html was the URL that I needed. After closely reading the man page, I found that -E (--adjust-extension) solved my problem. This flag forces wget to apply the .html extension onto all of the HTML files that it downloads.

Coupling that with -k to convert the links results in a 100% usable mirror that has all of the pages needed.

Here's an example map of the downloaded files and paths:

http://example.com/news           -->  /news.html
http://example.com/news/article1  -->  /news/article1.html

As a functional mirror, this is great. Default webserver configurations (at least for Apache) seem to allow the path http://sitemirror.com/news/article1 to load the /news/article1.html content. However, it may be necessary for a rewrite to keep the http:/sitemirror.com/news path from displaying a 404 or index for the folder. This should not be tough.

Oh, so here's my final wget command:

wget -p -r -l4 -E -k -nH http://example.com

Upvotes: 4

Related Questions