shaiss
shaiss

Reputation: 2010

Watch a web page for changes

I googled and couldn't find any could that would compare a webpage to a previous version.

In this case the page I'm trying to watch is link text. There are services that can watch a page, but I'd like to set this up on my own server.

I've set this up as a wiki so anyone can add to the code. Here's my idea

  1. Check if previous version of file exists. If false then download page
  2. If page exists, diff to find differences and email the new content along with dates of new and old versions.

This script would be called nightly via cron or on-demand via the browser (the latter is not a priority)

Sounds simple, maybe I'm just not looking in the right place.

Upvotes: 1

Views: 1505

Answers (2)

osti
osti

Reputation: 447

Perhaps a simple sh-script like this, featuring wget, diff & test?

#!/bin/sh

WWWURI="http://foo.bar/testfile.html"
LOCALCOPY="testfile.html"
TMPFILE="tmpfile"
WEBFILE="changed.html"

MAILADDRESS="$(whoami)"
SUBJECT_NEWFILE="$LOCALCOPY is new"
BODY_NEWFILE="first version of $LOCALCOPY loaded"
SUBJECT_CHANGEDFILE="$LOCALCOPY updated"
SUBJECT_NOTCHANGED="$LOCALCOPY not updated"
BODY_CHANGEDFILE="new version of $LOCALCOPY"

# test for old file
if [ -e "$LOCALCOPY" ]
then
    mv "$LOCALCOPY" "$LOCALCOPY.bak"
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    diff "$LOCALCOPY" "$LOCALCOPY.bak" > $TMPFILE

# test for update
    if [ -s "$TMPFILE" ]
    then
        echo "$SUBJECT_CHANGEDFILE"
        ( echo "$BODY_CHANGEDFILE" ; cat "$TMPFILE" ) | tee "$WEBFILE" | mail -s "$SUBJECT_CHANGEDFILE" "$MAILADDRESS"
    else
        echo "$SUBJECT_NOTCHANGED"
    fi
else
    wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
    echo "$BODY_NEWFILE"
    echo "$BODY_NEWFILE" | tee "$WEBFILE" | mail -s "$SUBJECT_NEWFILE" "$MAILADDRESS"
fi
[ -e "$TMPFILE" ] && rm "$TMPFILE"

Update: Pipe through tee, little spelling & remove of $TMPFILE

Upvotes: 3

mjv
mjv

Reputation: 75205

You can check This SO posting to get a few ideas and also information about the challenge of detecting "true" changes to a web page (with fluctuating advertisement block, and other "noise")

Upvotes: 0

Related Questions