WEB SCRAPING: Regex is not returning anything? What am I doing wrong?

I am trying to write a python script which uses the "urllib" and "re" libraries to extract weather forecast information off a html page, but I cannot seem to get any values returned, could anybody help me?

import urllib
import re

url = ('http://www.metoffice.gov.uk/public/weather/forecast/gcptz5sys')

htmlfile = urllib.urlopen(url)

htmltext = htmlfile.read()

regex =('<span title="Maximum daytime temperature" data-c="10" data-f="50">(.+?)<sup>°C</sup></span>')

pattern = re.compile(regex)

temp = re.findall(pattern,htmltext)

print (temp)

I am using Python 2.7 by the way...

Upvotes: 1

Views: 120

Answers (1)

Khamidulla
Khamidulla

Reputation: 2975

Try this:

#!/usr/bin/env python    

import urllib                                                                                                  
import re                                                                                                      


def main():                                                                                                    
    url = ('http://www.metoffice.gov.uk/public/weather/forecast/gcptz5sys')     

    htmlfile = urllib.urlopen(url)                                                                             

    htmltext = htmlfile.read()                                                                                 

    htmltext = str(htmltext).replace('\n', '')                                                                 
    htmltext = str(htmltext).replace('\t', '')                                                                 
    htmltext = str(htmltext).replace(' ', '') 

    pattern = re.compile('<spantitle="Maximumdaytimetemperature"data-c="7"data-f="45">(?P<temperature>.+?)<sup>&deg;C</sup></span>') 

    for match in pattern.finditer(htmltext):                                                                   
        print match.group('temperature')                                                                       

if __name__ == "__main__":                                                                                     
    main() 

So what I did here:

  1. Download content
  2. Remove all new line characters
  3. Remove all tabs
  4. Remove all space characteres
  5. I create and compile regex pattern where group 'temperature' will be used in order to retrieve temperature (Note: Regex dose note contains white space or new line)
  6. Using finditer function iterate over matched elements and print it to console.

P.S.: I removed all white space charachter because it can be changed dynamically in backend and your regex should be changed every time. By remove all white space and new line characters you can avoid this problem.

Upvotes: 1

Related Questions