Reputation:
I am attempting to scrape the countries and cities off craigslist and i am so close.
The problem i am having is that the cities skip and go to the next box.
The output i am trying to achieve is:
COUNTRY | STATE | CITY
US: ALABAMA: AUBURN
US: ALABAMA: BIRMINGHAM
US: ALABAMA: DOTHAN
But instead i get:
COUNTRY | STATE | CITY
US: ALABAMA: AUBURN
US: ALABAMA: ANCHORAGE / MAT-SU
US: ALABAMA: FLAGSTAFF / SEDONA
Then when i reach the end of the column, the STATE changes to the next STATE.
This is my current code:
from BeautifulSoup import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.craigslist.org/about/sites").read())
soup.prettify()
for h1 in soup.findAll('h1'):
colmask_div = h1.findNextSibling('div')
for box_div in colmask_div.findAll('div'):
h4 = box_div.find('h4')
for ul in box_div.findAll('ul'):
print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)
raw_input()
Skipping boxes somewhere but can't find where! Thanks. This is not homework, just a personal project to learn beautifulsoup :)
Upvotes: 3
Views: 151
Reputation: 7555
The problem with your code is that you are:
h4
element (state name) in the columnul
elements (lists of towns) in the whole columnli
element (town name) onlyI'd go for something more like this:
for country in soup.findAll('h1'):
country_data = country.findNextSibling('div')
for state, towns in zip(country_data.findAll('h4'), country_data.findAll('ul')):
for town in towns.findAll('li'):
print '{} : {} : {}'.format(country.text, state.text, town.text)
raw_input()
You don't need to process each column in turn. Here I am getting BS to do the work of fetching all the nested h4
and ul
elements in the top level div
for a country as two lists.
I then use the python zip()
function to iterate over those two lists in step.
US : Alabama : auburn
US : Alabama : birmingham
US : Alabama : dothan
...
US : Alabama : tuscaloosa
US : Alaska : anchorage / mat-su
...
US : Territories : U.S. virgin islands
Canada : Alberta : calgary
...
In Python 2, you could improve the performance of this code by using itertools.izip()
. This doesn't create the whole list of pairs of elements in memory from the two inputs, but uses a generator instead.
In Python 3, the regular zip()
function does this by default.
Upvotes: 2