Tharek M
Tharek M

Reputation: 9

python web scraping inside html commets

the following is not necessarily a question. I created a little piece of code to extract data from a web page and I want to know what do you thing about the code and how to improve it.

I need to know the dates for an interview for a PhD. They don't send us emails. Here they will post the dates. I realized that the two PhD positions I am interested in are inside HTML comments. They both start with the string URBAN.

I created a regex to find all the comments

regex = r"<!--(.*?)-->"

and used a for loop to check inside those comments the existence of the words URBAN. The absence of the string inside the comment means, hopefully, that they posted the dates.

This is my code:

import requests, re, time, smtplib

url = "http://dottorato.polito.it/Esami_accesso.html"

DEBUG = False

foundInComment = True

""" 
. matches anything but \n   
* 0 or more occurrences of the pattern to its left
() groups
? for non-greedy 
"""
    regex = r"<!--(.*?)-->"

while foundInComment:
    try:
        r = requests.get(url)
        html = r.text 

        result = re.findall(regex,html,re.DOTALL) # re.DOTALL makes . match also \n 

        for match in result:
            if len(re.findall("URBAN",match)) > 1: #One of the commets has to have at least two URBAN
                foundInComment = True
                print("\"URBAN AND REGIONAL DEVELOPMEN\" found more than once in a comment at " 
                                       + time.strftime("%H:%M:%S"))
                break
            foundInComment = False

        time.sleep(600)

    except KeyboardInterrupt:
        raise
    except Exception as e:
        print e
        print "Going to sleep for 1 min"
        time.sleep(60)

if not DEBUG:
    fromaddr = '[email protected]'
    toaddrs  = ['[email protected]', fromaddr]

    msg = 'Subject: PHD polito\n\n Go to %s' % url 

    # Credentials
    username = 'someone'
    password = 'password'

    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()
    server.login(username,password)
    server.sendmail(fromaddr, toaddrs, msg)
    server.quit()

    print "End of program"

So, what do you think?

Thanks in advance!

PS: this is the PART of the HTML comment containing the words URBAN:

<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN</a></li>
<li><a href="colloqui/Architettura_Storia_Progetto2.pdf">URBAN AND REGIONAL DEVELOPMEN - Cluster Tecnologie per le Smart Communities - Progetto Edifici a Zero Consumo Energetico in Distretti Urbani Intelligenti</a></li>
-->

I'm almost sure that they will copy this and paste it out of the comment inside the web page.

Upvotes: 0

Views: 71

Answers (1)

alecxe
alecxe

Reputation: 474171

An alternative (and I think more reliable) approach would be to use a specialized tool for the job - an HTML Parser. Example, using BeautifulSoup, that prints out all comments containing URBAN word:

import requests
from bs4 import BeautifulSoup, Comment

url = "http://dottorato.polito.it/Esami_accesso.html"
response = requests.get(url)

soup = BeautifulSoup(response.content)
print soup.find_all(text=lambda text:isinstance(text, Comment) and 'URBAN' in text)

Upvotes: 1

Related Questions