Matthieu Moy
Matthieu Moy

Reputation: 16547

Checking for dead links locally in a static website (using wget?)

A very nice tool to check for dead links (e.g. links pointing to 404 errors) is wget --spider. However, I have a slightly different use case where I generate a static website, and want to check for broken links before uploading. More precisely, I want to check both:

I tried wget --spider --force-html -i file-to-check.html, which reads the local file, considers it as HTML and follows each links. Unfortunately, it can't deal with relative links within the local HTML file (errors out with Cannot resolve incomplete link some/file.pdf). I tried using file:// but wget does not support it.

Currently, I have a hack based on running a local webserver through python3 http.server and checking the local files through HTTP:

python3 -m http.server &
pid=$! 
sleep .5
error=0
wget --spider -nd -nv -H -r -l 1 http://localhost:8000/index.html || error=$? 
kill $pid
wait $pid
exit $error

I'm not really happy with this for several reasons:

Ideally, python3 -m http.server would have an option to run a command when the server is ready and would shutdown itself after the command is completed. That sounds doable by writing a bit of Python, but I was wondering whether a cleaner solution exists.

Did I miss anything? Is there a better solution? I'm mentioning wget in my question because it does almost what I want, but using wget is not a requirement for me (nor is python -m http.server). I just need to have something easy to run and automate on Linux.

Upvotes: 17

Views: 1571

Answers (2)

Matthieu Moy
Matthieu Moy

Reputation: 16547

Tarun Lalwani's answer is correct, and following the advices given there one can write a clean and short shell script (relying on Python and awk). Another solution is to write the script completely in Python, giving a slightly more verbose but arguably cleaner script. The server can be launched in a thread, then the command to check the website is executed, and finally the server is shut down. We don't need to parse the textual output nor to send a signal to an external process anymore. The key parts of the script are therefore:

def start_server(port,
                 server_class=HTTPServer,
                 handler_class=SimpleHTTPRequestHandler):
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    thread = threading.Thread(target=httpd.serve_forever)
    thread.start()
    return httpd

def main(cmd, port):
    httpd = start_server(port)
    status = subprocess.call(cmd)
    httpd.shutdown()
    sys.exit(status)

I wrote a slightly more advanced script (with a bit of command-line option parsing on top of this) and published it as: https://gitlab.com/moy/check-links

Upvotes: 0

Tarun Lalwani
Tarun Lalwani

Reputation: 146610

So I think you are running in the right direction. I would use wget and python as they are two readily available options on many systems. And the good part is that it gets the job done for you. Now what you want is to listen for Serving HTTP on 0.0.0.0 from the stdout of that process.

So I would start the process using something like below

python3 -u -m http.server > ./myserver.log &

Note the -u I have used here for unbuffered output, this is really important

Now next is waiting for this text to appear in myserver.log

timeout 10 awk '/Serving HTTP on 0.0.0.0/{print; exit}' <(tail -f ./myserver.log)

So 10 seconds is your maximum wait time here. And rest is self-explanatory. Next about your kill $pid. I don't think it is a problem, but if you want it to be more like the way a user does it then I would change it to

kill -s SIGINT $pid

This will be equivalent to you processing CTRL+C after launching the program. Also I would handle the SIGINT my bash script as well using something like below

https://unix.stackexchange.com/questions/313644/execute-command-or-function-when-sigint-or-sigterm-is-send-to-the-parent-script/313648

The above basically adds below to top of the bash script to handle you killing the script using CTRL+C or external kill signal

#!/bin/bash
exit_script() {
    echo "Printing something special!"
    echo "Maybe executing other commands!"
    trap - SIGINT SIGTERM # clear the trap
    kill -- -$$ # Sends SIGTERM to child/sub processes
}

trap exit_script SIGINT SIGTERM

Upvotes: 11

Related Questions