Reputation: 2172
I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:
Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff
command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.
Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?
Upvotes: 3
Views: 1269
Reputation: 14144
Since the div you want is specific to the site, you will probably have to setup a simple check.
This consists of
urllib.urlopen(URL)
or requests.get(URL)
.Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.
Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:
<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>
Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter")
.
Instead, I would elect to pull out all the inner elements that have class "count".
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))
BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).
Upvotes: 3
Reputation: 3215
Another way to do it if you have access to a linux terminal is to add a cronjob
$ crontab -e
and place the following line (everyday at 16:00)
0 16 * * * diff_web_page.sh
where contents of diff_web_page.sh
are
#!/bin/bash
URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
# First time that we are running, create the file and exit.
lynx -dump "$URL" &> $TMP_FILE;
# lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
# the file exist, grub the new version and compare it
lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
diff -Npaur $TMP_FILE $TMP_FILE.new;
mv $TMP_FILE.new $TMP_FILE;
fi
this will email the diff of the webpage every time its executed in the user@host (at the linux box you are running this cron job).
If you want a specific div, you can awk the output with pcregrep -M
when dumping the web page with lynx
Upvotes: 4