Reputation: 67
I'm trying to learn how to do web scraping, and it's not coming out in the format i would hope it would have. Here is the issue I'm running into:
import urllib
import re
pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]
i=0
while i<len(pagelist):
url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<h2 style="float:left;">(.+?)</h2>'
pattern = re.compile(regex)
storeName = re.findall(pattern,htmltext)
print "Store Name=", storeName[i]
i+=1
This code produces this result: Store Name = Boost Mobile store by wireless depot Store Name = Wal-Mart ..... and so for 10 different stores, I'm assuming this happens because
while i<len(pagelist):
is only = to ten
so it only prints out ten of the stores instead of all stores listed on all pages.
When I change the second to last line to this
print storeName
It will print out every store name listed on each page but not in the format above but like this: 'Boost mobile store by wireless depot', 'boost mobile store by kob wireless', 'marietta check chashing services',..... and so on for about another 120 entries. so how do I get it in the desired format of: "Store Name = ...." rather then: 'name','name',.....
Upvotes: 1
Views: 75
Reputation: 473753
Do not parse HTML with regex. Use a specialized tool - an HTML Parser
.
Here's the solution using BeautifulSoup
:
import urllib2
from bs4 import BeautifulSoup
base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url))
print "Page Number: %s" % page
results = soup.find('table', class_="results")
for h2 in results.find_all('h2'):
print h2.text
It prints:
Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...
As you can see, first we find a table
tag with results
class - this is where the store names actually are. Then, inside the table
we are finding all of the h2
tags. This is more robust than relying on the style
attribute of a tag.
You can also make use of SoupStrainer
. It would improve the performance since it would parse only the part of document that you specify:
required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
url = base_url.format(page=page, zipcode=zipcode)
soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)
print "Page Number: %s" % page
for h2 in soup.find_all('h2'):
print h2.text
Here we are saying: "parse only the table
tag with the class results
. And give us all of the h2
tags inside it."
Also, if you want to improve performance, you can let BeautifulSoup
use lxml
parser under the hood:
soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part)
Hope that helps.
Upvotes: 2
Reputation: 75535
storeName
is an array, and you need to loop through it. Currently you are indexing into it a single time at each page, using the page number, which was probably not your intent.
Here is a correct version of your code, with the loop added.
import urllib
import re
pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]
i=0
while i<len(pagelist):
url = "http://www.boostmobile.com/stores/?" +pagelist[i]+"&zipcode=30008"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<h2 style="float:left;">(.+?)</h2>'
pattern = re.compile(regex)
storeName = re.findall(pattern,htmltext)
for sn in storeName:
print "Store Name=", sn
i+=1
Upvotes: 1