Damian Stelucir
Damian Stelucir

Reputation: 55

Webscraper will not work

I have followed a tutorial pretty much to the letter, and I want my scraper to scrape all the links to the specific pages containing the info about each police station, but it returns the entire site almost.

from urllib import urlopen
import re

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)

listiterator = []
listiterator[:] = range(0,16)

for i in listiterator:
    print a 
    print "\n"

f.close()

Upvotes: 0

Views: 1054

Answers (3)

KurzedMetal
KurzedMetal

Reputation: 12946

Use BeautifulSoup

from bs4 import BeautifulSoup
from urllib2 import urlopen

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

bs = BeautifulSoup(f)

for tag in bs.find_all('span', {'class': 'listlink-police'}):
    print tag.a['href']

Upvotes: 7

Nix
Nix

Reputation: 58612

There are over 1.6k links with that class on it.

I think its working correctly... what makes you think it's not working?


And you should definitely use Beautiful Soup, it's stupid simple and extremely useable.

Upvotes: -1

tripleee
tripleee

Reputation: 189900

You are using regex to parse HTML. You shouldn't, because you end up with just this type of problem. For a start, the .* wildcard will match as much text as it can. But once you fix that, you will pluck another fruit from the Tree of Frustration. Use a proper HTML parser instead.

Upvotes: 3

Related Questions