Reputation: 11
I’m just dipping my toes into Python right now and I learn best (albeit, non-efficiently) with a project. My current project is a twitter bot that scrapes a government website for the latest COVID-19 case counts in my jurisdiction and tweets them out, building off this awesome tutorial.
Functionally it is working, but I want to finesse it so that it only posts when that data is updated and new. Otherwise, it’s just an account that posts the same information every day rather than a news account.
I thought the built-in rules in the Twitter API that don’t allow duplicate tweets would work automatically to filter out old information. Sometimes it does work, but the rule isn’t strict enough- it appears the account can still post duplicates as long as it doesn’t do it too often. Ideally, I’d like to make it more strict in my code. It would need to compare the new text to the last tweet, and only tweet if there was a difference in the text.
Can anyone give me some guidance on if this is possible, and how best to get it done? I’m at a stage in my coding that I’m not sure what terms to use in my search to find a solution.
Here’s the current code as it stands:
import sys
from config import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET
import tweepy
import requests
from lxml import html
from threading import Timer
def create_tweet():
response = requests.get('https://yukon.ca/en/case-counts-covid-19')
doc = html.fromstring(response.content)
A, B, C, D, E, F = doc.xpath('//table[@class="table"]//td[2]//text()')
tweet = f'''Yukon COVID-19 cases count
Total people tested: {A}
Confirmed cases: {B}
Recovered cases: {C}
Deaths: {D}
Negative results: {E}
Pending results: {F}
Data from: https://yukon.ca/en/case-counts-covid-19
'''
return tweet
if __name__ == '__main__':
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
# Create API object
api = tweepy.API(auth)
try:
api.verify_credentials()
print('Authentication Successful')
except:
print('Error while authenticating API')
sys.exit(1)
tweet = create_tweet()
api.update_status(tweet)
print('Tweet successful')
Upvotes: 1
Views: 500
Reputation: 35
cool project :)
Without more context as to the website you're scraping, I can't comment on whether your data source assigns some kind of unique identifier to its posts - this can be something synthetic like an ID # or even a timestamp.
If your source posts do have IDs, then you can store that ID each time you tweet in some kind of database or file.
Then, when your scraper runs again, it can check against its list of IDs to prevent duplication.
If your data source isn't publishing a timestamp or other ID to its posts, I suggest writing a function that takes the text of your potential tweet, passes it through a hash function, and checks the hash against your file/database of past hashes.
Here's a super-simple tutorial on using the MD5 hash function from hashlib
to generate an MD5 digest (hash) string you should be able to easy compare/store:
https://www.geeksforgeeks.org/md5-hash-python/
Upvotes: 1