Beautiful Soup element access

Question

I am trying to use BeautifulSoup to extract information from a web page. My code is here:

from bs4 import BeautifulSoup
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/wiki/American_films_of_1971')
page = infile.read()
soup = BeautifulSoup(page)
soup.prettify().encode('utf8')
print (soup.find_all("table", "wikitable"))

output

[

Title
Director
Cast
Genre/Note






$ aka Dollars
Richard Brooks
Warren Beatty, Goldie Hawn
Comedy, Crime



200 Motels
Tony Palmer, Charles Swenson
Frank Zappa, Ringo Starr, Theodore Bikel
Comedy, Musical


]

I want to extract each td element in each tr element. Something like

aka Dollars | Richard Brooks | Warren Beatty | Crime
200 Models | Tony Palmer, Charles Swenson | Frank Zappa | Comedy

I am unsure how to look into the child tags after I get the part of the document I want.

I was wondering if BeautifulSoup is the right tool or if I should look at something else.

Martijn Pieters · Accepted Answer

Each result in the .find_all() list is another element object, so you can do further searches on these:

for table in soup.find_all("table", "wikitable"):
    for row in table.find_all('tr'):
        cells = []
        for cell in row.find_all('td'):
            cells.append(cell.get_text())
        print(' | '.join(cells))

This gives me:

$ aka Dollars | Richard Brooks | Warren Beatty, Goldie Hawn | Comedy, Crime | 
200 Motels | Tony Palmer, Charles Swenson | Frank Zappa, Ringo Starr, Theodore Bikel | Comedy, Musical |

Beautiful Soup element access

Answers (1)

Related Questions