Manu
Manu

Reputation: 4500

Check if a URL goes to a page containing the text "404"

I have a bash script to check the HTTP status code of a list of urls, but I realize that some, while appearing to be "200", display actually a page containing "error 404". How could I check for that ?

Here's my current script :

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out '%{http_code}\n' "$LINE"
done < url-list.txt

(I got it from a precedent question : script to get the HTTP status code of a list of urls ?)

EDIT There seems to be a bug in the script : it returns "200" but if I wget -o log that same adress I get "404 not found"

Upvotes: 3

Views: 6026

Answers (2)

clt60
clt60

Reputation: 63952

For the fun - here is an BASH solution:

dosomething() {
        code="$1"; url="$2"
        case "$code" in
                200) echo "OK for $url";;
                302) echo "redir for $url";;
                404) echo "notfound for $url";;
                *) echo "other $code for $url";;
        esac
}

#MAIN program
while read url
do
        uri=($(echo "$url" | sed 's~http://\([^/][^/]*\)\(.*\)~\1 \2~'))
        HOST=${uri[0]:=localhost}
        FILE=${uri[1]:=/}
        exec {SOCKET}<>/dev/tcp/$HOST/80
        echo -ne "GET $FILE HTTP/1.1\nHost: $HOST\n\n" >&${SOCKET}
        res=($(<&${SOCKET} sed '/^.$/,$d' | grep '^HTTP'))
        dosomething ${res[1]} "$url"
done << EOF
http://stackoverflow.com
http://stackoverflow.com/some/bad/url
EOF

Upvotes: 3

sapht
sapht

Reputation: 2829

Well, you could grok the response body and look for "404", "Error 404", "Not Found", "404 Not Found" etc printed in plaintext, but that is likely to give both false negatives and false positives. Though if the server sends 200 for what's supposed to be a 404 somebody didn't do their job right.

Upvotes: 1

Related Questions