Reputation: 1
I need to grab all the High school names along with their city from this website. Using BeautifulSoup4. I added the none working code below. Thanks so much.
http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas
import urllib2
bs4 import BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders = [('User-again','Mozilla/5.0' ) ]
url = ("http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas")
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
print get_text(soup.find_all('il'))
! [html] (http://i1074.photobucket.com/albums/w402/phillipjones2/Screenshot2014-08-07at53445PM_zpsebe195cb.png)
Upvotes: 0
Views: 493
Reputation: 102922
There are numerous errors in your program. Below is a working one that should serve as a base for additional optimization.
import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`
url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas"
# you don't need () around it
r = requests.get(url)
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page
soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
print item.get_text() # you need to iterate over all the elements
# found by `find_all()`
And that's it. This will get you the text of every <li>...</li>
item on the page. As you'll see when you run the program, there are a lot of irrelevant results, such as the table of contents, the menu items on the left side, the footer, etc. I'll leave it up to you to figure out how to get just the names of the schools, and separate out county names and other cruft.
For reference, have a good read through the BS docs. They'll answer a lot of your questions.
Upvotes: 1