EGr
EGr

Reputation: 2172

Is there an easy way to script a comparison of a web page over time?

I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:

  1. It checks the whole site, not just one DIV
  2. It only checks the site once per hour

Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.

Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?

Upvotes: 3

Views: 1269

Answers (2)

Kyle Kelley
Kyle Kelley

Reputation: 14144

Since the div you want is specific to the site, you will probably have to setup a simple check.

This consists of

  • Downloading the HTML - urllib.urlopen(URL) or requests.get(URL).
  • Extracting just the right section (BeautifulSoup, roll your own)
  • Performing your comparison (straight comparison or difflib).

Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.

Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:

<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>

Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter").

Instead, I would elect to pull out all the inner elements that have class "count".

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).

Upvotes: 3

user2599522
user2599522

Reputation: 3215

Another way to do it if you have access to a linux terminal is to add a cronjob

$ crontab -e

and place the following line (everyday at 16:00)

0   16   *   *   *   diff_web_page.sh

where contents of diff_web_page.sh are

#!/bin/bash

URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
    # First time that we are running, create the file and exit.
    lynx -dump "$URL" &> $TMP_FILE;
    # lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
    # the file exist, grub the new version and compare it
    lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
    diff -Npaur $TMP_FILE $TMP_FILE.new;
    mv $TMP_FILE.new $TMP_FILE;
fi

this will email the diff of the webpage every time its executed in the user@host (at the linux box you are running this cron job).

If you want a specific div, you can awk the output with pcregrep -M when dumping the web page with lynx

Upvotes: 4

Related Questions