WEB SCRAPING: Regex is not returning anything? What am I doing wrong?

Question

I am trying to write a python script which uses the "urllib" and "re" libraries to extract weather forecast information off a html page, but I cannot seem to get any values returned, could anybody help me?

import urllib
import re

url = ('http://www.metoffice.gov.uk/public/weather/forecast/gcptz5sys')

htmlfile = urllib.urlopen(url)

htmltext = htmlfile.read()

regex =('(.+?)^°C')

pattern = re.compile(regex)

temp = re.findall(pattern,htmltext)

print (temp)

I am using Python 2.7 by the way...

Khamidulla · Accepted Answer

Try this:

#!/usr/bin/env python    

import urllib                                                                                                  
import re                                                                                                      


def main():                                                                                                    
    url = ('http://www.metoffice.gov.uk/public/weather/forecast/gcptz5sys')     

    htmlfile = urllib.urlopen(url)                                                                             

    htmltext = htmlfile.read()                                                                                 

    htmltext = str(htmltext).replace('
', '')                                                                 
    htmltext = str(htmltext).replace('	', '')                                                                 
    htmltext = str(htmltext).replace(' ', '') 

    pattern = re.compile('(?P.+?)^°C') 

    for match in pattern.finditer(htmltext):                                                                   
        print match.group('temperature')                                                                       

if __name__ == "__main__":                                                                                     
    main()

So what I did here:

Download content
Remove all new line characters
Remove all tabs
Remove all space characteres
I create and compile regex pattern where group 'temperature' will be used in order to retrieve temperature (Note: Regex dose note contains white space or new line)
Using finditer function iterate over matched elements and print it to console.

P.S.: I removed all white space charachter because it can be changed dynamically in backend and your regex should be changed every time. By remove all white space and new line characters you can avoid this problem.

WEB SCRAPING: Regex is not returning anything? What am I doing wrong?

Answers (1)

Related Questions