Reputation: 55

Webscraper will not work

I have followed a tutorial pretty much to the letter, and I want my scraper to scrape all the links to the specific pages containing the info about each police station, but it returns the entire site almost.

from urllib import urlopen
import re

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)

listiterator = []
listiterator[:] = range(0,16)

for i in listiterator:
    print a 
    print "\n"

f.close()

Upvotes: 0

Answers (3)

KurzedMetal

Reputation: 12946

Use BeautifulSoup

from bs4 import BeautifulSoup
from urllib2 import urlopen

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

bs = BeautifulSoup(f)

for tag in bs.find_all('span', {'class': 'listlink-police'}):
    print tag.a['href']

Upvotes: 7

Nix

Reputation: 58612

There are over 1.6k links with that class on it.

I think its working correctly... what makes you think it's not working?

And you should definitely use Beautiful Soup, it's stupid simple and extremely useable.

Upvotes: -1

tripleee

Reputation: 189900

You are using regex to parse HTML. You shouldn't, because you end up with just this type of problem. For a start, the .* wildcard will match as much text as it can. But once you fix that, you will pluck another fruit from the Tree of Frustration. Use a proper HTML parser instead.

Upvotes: 3

Webscraper will not work

Answers (3)

Related Questions