Trouble printing all items from a list in python

Question

I'm trying to learn how to do web scraping, and it's not coming out in the format i would hope it would have. Here is the issue I'm running into:

import urllib
import re

pagelist = ["page=1","page=2","page=3","page=4","page=5","page=6","page=7","page=8","page=9","page=10"]
ziplocations = ["=30008","=30009"]

i=0
while i

alecxe · Accepted Answer

Do not parse HTML with regex. Use a specialized tool - an HTML Parser.

Here's the solution using BeautifulSoup:

import urllib2
from bs4 import BeautifulSoup

base_url = "http://www.boostmobile.com/stores/?page={page}&zipcode={zipcode}"
num_pages = 10
zipcode = 30008

for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url))

    print "Page Number: %s" % page
    results = soup.find('table', class_="results")
    for h2 in results.find_all('h2'):
        print h2.text

It prints:

Page Number: 1
Boost Mobile Store by Wireless Depot
Boost Mobile Store by KOB Wireless
Marietta Check Cashing Services
...
Page Number: 2
Target
Wal-Mart
...

As you can see, first we find a table tag with results class - this is where the store names actually are. Then, inside the table we are finding all of the h2 tags. This is more robust than relying on the style attribute of a tag.

You can also make use of SoupStrainer. It would improve the performance since it would parse only the part of document that you specify:

required_part = SoupStrainer('table', class_="results")
for page in xrange(1, num_pages + 1):
    url = base_url.format(page=page, zipcode=zipcode)
    soup = BeautifulSoup(urllib2.urlopen(url), parse_only=required_part)

    print "Page Number: %s" % page
    for h2 in soup.find_all('h2'):
        print h2.text

Here we are saying: "parse only the table tag with the class results. And give us all of the h2 tags inside it."

Also, if you want to improve performance, you can let BeautifulSoup use lxml parser under the hood:

soup = BeautifulSoup(urllib2.urlopen(url), "lxml", parse_only=required_part)

Hope that helps.

Trouble printing all items from a list in python

Answers (2)

Related Questions