python web scraping inside html commets

Question

the following is not necessarily a question. I created a little piece of code to extract data from a web page and I want to know what do you thing about the code and how to improve it.

I need to know the dates for an interview for a PhD. They don't send us emails. Here they will post the dates. I realized that the two PhD positions I am interested in are inside HTML comments. They both start with the string URBAN.

I created a regex to find all the comments

regex = r""

and used a for loop to check inside those comments the existence of the words URBAN. The absence of the string inside the comment means, hopefully, that they posted the dates.

This is my code:

import requests, re, time, smtplib

url = "http://dottorato.polito.it/Esami_accesso.html"

DEBUG = False

foundInComment = True

""" 
. matches anything but 
   
* 0 or more occurrences of the pattern to its left
() groups
? for non-greedy 
"""
    regex = r""

while foundInComment:
    try:
        r = requests.get(url)
        html = r.text 

        result = re.findall(regex,html,re.DOTALL) # re.DOTALL makes . match also 
 

        for match in result:
            if len(re.findall("URBAN",match)) > 1: #One of the commets has to have at least two URBAN
                foundInComment = True
                print(""URBAN AND REGIONAL DEVELOPMEN" found more than once in a comment at " 
                                       + time.strftime("%H:%M:%S"))
                break
            foundInComment = False

        time.sleep(600)

    except KeyboardInterrupt:
        raise
    except Exception as e:
        print e
        print "Going to sleep for 1 min"
        time.sleep(60)

if not DEBUG:
    fromaddr = 'someMail@gmail.com'
    toaddrs  = ['otherMail@gmail.com', fromaddr]

    msg = 'Subject: PHD polito

 Go to %s' % url 

    # Credentials
    username = 'someone'
    password = 'password'

    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()
    server.login(username,password)
    server.sendmail(fromaddr, toaddrs, msg)
    server.quit()

    print "End of program"

So, what do you think?

Thanks in advance!

PS: this is the PART of the HTML comment containing the words URBAN:

URBAN AND REGIONAL DEVELOPMEN
URBAN AND REGIONAL DEVELOPMEN - Cluster Tecnologie per le Smart Communities - Progetto Edifici a Zero Consumo Energetico in Distretti Urbani Intelligenti
-->

I'm almost sure that they will copy this and paste it out of the comment inside the web page.

python web scraping inside html commets

Answers (1)

Related Questions