Reputation: 9
the following is not necessarily a question. I created a little piece of code to extract data from a web page and I want to know what do you thing about the code and how to improve it.
I need to know the dates for an interview for a PhD. They don't send us emails. Here they will post the dates. I realized that the two PhD positions I am interested in are inside HTML comments. They both start with the string URBAN.
I created a regex to find all the comments
regex = r"<!--(.*?)-->"
and used a for loop to check inside those comments the existence of the words URBAN. The absence of the string inside the comment means, hopefully, that they posted the dates.
This is my code:
import requests, re, time, smtplib
url = "http://dottorato.polito.it/Esami_accesso.html"
DEBUG = False
foundInComment = True
"""
. matches anything but \n
* 0 or more occurrences of the pattern to its left
() groups
? for non-greedy
"""
regex = r"<!--(.*?)-->"
while foundInComment:
try:
r = requests.get(url)
html = r.text
result = re.findall(regex,html,re.DOTALL) # re.DOTALL makes . match also \n
for match in result:
if len(re.findall("URBAN",match)) > 1: #One of the commets has to have at least two URBAN
foundInComment = True
print("\"URBAN AND REGIONAL DEVELOPMEN\" found more than once in a comment at "
+ time.strftime("%H:%M:%S"))
break
foundInComment = False
time.sleep(600)
except KeyboardInterrupt:
raise
except Exception as e:
print e
print "Going to sleep for 1 min"
time.sleep(60)
if not DEBUG:
fromaddr = '[email protected]'
toaddrs = ['[email protected]', fromaddr]
msg = 'Subject: PHD polito\n\n Go to %s' % url
# Credentials
username = 'someone'
password = 'password'
server = smtplib.SMTP('smtp.gmail.com:587')
server.starttls()
server.login(username,password)
server.sendmail(fromaddr, toaddrs, msg)
server.quit()
print "End of program"
So, what do you think?
Thanks in advance!
PS: this is the PART of the HTML comment containing the words URBAN:
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN</a></li>
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN - Cluster Tecnologie per le Smart Communities - Progetto Edifici a Zero Consumo Energetico in Distretti Urbani Intelligenti</a></li>
-->
I'm almost sure that they will copy this and paste it out of the comment inside the web page.
Upvotes: 0
Views: 71
Reputation: 474171
An alternative (and I think more reliable) approach would be to use a specialized tool for the job - an HTML Parser. Example, using BeautifulSoup
, that prints out all comments containing URBAN
word:
import requests
from bs4 import BeautifulSoup, Comment
url = "http://dottorato.polito.it/Esami_accesso.html"
response = requests.get(url)
soup = BeautifulSoup(response.content)
print soup.find_all(text=lambda text:isinstance(text, Comment) and 'URBAN' in text)
Upvotes: 1