user3434449
user3434449

Reputation:

Python BeautifulSoup City Scraping Complications

I am attempting to scrape the countries and cities off craigslist and i am so close.

The problem i am having is that the cities skip and go to the next box.

The output i am trying to achieve is:

COUNTRY   |    STATE   |   CITY
US:          ALABAMA:       AUBURN
US:          ALABAMA:       BIRMINGHAM
US:          ALABAMA:       DOTHAN

But instead i get:

COUNTRY   |    STATE   |   CITY
US:          ALABAMA:       AUBURN
US:          ALABAMA:       ANCHORAGE / MAT-SU
US:          ALABAMA:       FLAGSTAFF / SEDONA

Then when i reach the end of the column, the STATE changes to the next STATE.

This is my current code:

from BeautifulSoup import BeautifulSoup
import urllib2


soup = BeautifulSoup(urllib2.urlopen("http://www.craigslist.org/about/sites").read())
soup.prettify()

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)
            raw_input()

Skipping boxes somewhere but can't find where! Thanks. This is not homework, just a personal project to learn beautifulsoup :)

Upvotes: 3

Views: 151

Answers (1)

Jamie Cockburn
Jamie Cockburn

Reputation: 7555

The problem with your code is that you are:

  1. Getting the first h4 element (state name) in the column
  2. Getting all the ul elements (lists of towns) in the whole column
  3. For each list of towns, outputting the first li element (town name) only
  4. Moving on to the next list of towns, without moving on to the next state

I'd go for something more like this:

for country in soup.findAll('h1'):
    country_data = country.findNextSibling('div')
    for state, towns in zip(country_data.findAll('h4'), country_data.findAll('ul')):
        for town in towns.findAll('li'):
            print '{} : {} : {}'.format(country.text, state.text, town.text)
            raw_input()

You don't need to process each column in turn. Here I am getting BS to do the work of fetching all the nested h4 and ul elements in the top level div for a country as two lists.

I then use the python zip() function to iterate over those two lists in step.


Output

US : Alabama : auburn
US : Alabama : birmingham
US : Alabama : dothan
...
US : Alabama : tuscaloosa
US : Alaska : anchorage / mat-su
...
US : Territories : U.S. virgin islands
Canada : Alberta : calgary
...

Performance

In Python 2, you could improve the performance of this code by using itertools.izip(). This doesn't create the whole list of pairs of elements in memory from the two inputs, but uses a generator instead.

In Python 3, the regular zip() function does this by default.

Upvotes: 2

Related Questions