Is there an easy way to script a comparison of a web page over time?

Question

I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:

It checks the whole site, not just one DIV
It only checks the site once per hour

Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.

Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?

Kyle Kelley · Accepted Answer

Since the div you want is specific to the site, you will probably have to setup a simple check.

This consists of

Downloading the HTML - urllib.urlopen(URL) or requests.get(URL).
Extracting just the right section (BeautifulSoup, roll your own)
Performing your comparison (straight comparison or difflib).

Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.

Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:

Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter").

Instead, I would elect to pull out all the inner elements that have class "count".

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).

Is there an easy way to script a comparison of a web page over time?

Answers (2)

Related Questions