Reputation: 55
I have followed a tutorial pretty much to the letter, and I want my scraper to scrape all the links to the specific pages containing the info about each police station, but it returns the entire site almost.
from urllib import urlopen
import re
f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()
b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)
listiterator = []
listiterator[:] = range(0,16)
for i in listiterator:
print a
print "\n"
f.close()
Upvotes: 0
Views: 1054
Reputation: 12946
Use BeautifulSoup
from bs4 import BeautifulSoup
from urllib2 import urlopen
f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()
bs = BeautifulSoup(f)
for tag in bs.find_all('span', {'class': 'listlink-police'}):
print tag.a['href']
Upvotes: 7
Reputation: 58612
There are over 1.6k links with that class on it.
I think its working correctly... what makes you think it's not working?
And you should definitely use Beautiful Soup, it's stupid simple and extremely useable.
Upvotes: -1
Reputation: 189900
You are using regex to parse HTML. You shouldn't, because you end up with just this type of problem. For a start, the .*
wildcard will match as much text as it can. But once you fix that, you will pluck another fruit from the Tree of Frustration. Use a proper HTML parser instead.
Upvotes: 3