user3827516
user3827516

Reputation: 1

BeautifulSoup4 parsing html

I need to grab all the High school names along with their city from this website. Using BeautifulSoup4. I added the none working code below. Thanks so much.

http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas

import urllib2
bs4 import BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders = [('User-again','Mozilla/5.0' ) ]

url = ("http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas")

ourUrl = opener.open(url).read()

soup = BeautifulSoup(ourUrl)

print get_text(soup.find_all('il')) 

! [html] (http://i1074.photobucket.com/albums/w402/phillipjones2/Screenshot2014-08-07at53445PM_zpsebe195cb.png)

Upvotes: 0

Views: 493

Answers (1)

MattDMo
MattDMo

Reputation: 102922

There are numerous errors in your program. Below is a working one that should serve as a base for additional optimization.

import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`

url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas" 
# you don't need () around it
r = requests.get(url) 
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page

soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
    print item.get_text()        # you need to iterate over all the elements
                                 # found by `find_all()`

And that's it. This will get you the text of every <li>...</li> item on the page. As you'll see when you run the program, there are a lot of irrelevant results, such as the table of contents, the menu items on the left side, the footer, etc. I'll leave it up to you to figure out how to get just the names of the schools, and separate out county names and other cruft.

For reference, have a good read through the BS docs. They'll answer a lot of your questions.

Upvotes: 1

Related Questions