Rob Richmond
Rob Richmond

Reputation: 875

BeautifulSoup findAll with name and text

I'm trying to match the TH tag in the below HTML (file.txt):

<TABLE WIDTH="71%" BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR VALIGN="BOTTOM">
<TH WIDTH="34%" ALIGN="LEFT"><FONT SIZE=1><B>Name<BR> </B></FONT><HR NOSHADE></TH>
<TH WIDTH="3%"><FONT SIZE=1>&nbsp;</FONT></TH>
<TH WIDTH="5%" ALIGN="CENTER"><FONT SIZE=1><B>Age</B></FONT><HR NOSHADE></TH>
<TH WIDTH="3%"><FONT SIZE=1>&nbsp;</FONT></TH>
<TH WIDTH="55%" ALIGN="CENTER"><FONT SIZE=1><B>Positions</B></FONT><HR NOSHADE></TH>
</TR>
<TR BGCOLOR="#CCEEFF" VALIGN="TOP">
<TD WIDTH="34%"><FONT SIZE=2>Stephen A. Wynn</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="5%" ALIGN="CENTER"><FONT SIZE=2>60</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="55%"><FONT SIZE=2>Chairman of the Board and Chief Executive Officer</FONT></TD>
</TR>
<TR BGCOLOR="White" VALIGN="TOP">
<TD WIDTH="34%"><FONT SIZE=2>Kazuo Okada</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="5%" ALIGN="CENTER"><FONT SIZE=2>60</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="55%"><FONT SIZE=2>Vice Chairman of the Board</FONT></TD>
</TR>
</TABLE>

I have tried the following, but it doesn't seem to work:

from bs4 import BeautifulSoup

infile = open("file.txt")
soup = BeautifulSoup(infile.read())
#this works
soup.findAll('th')
#this works but isn't particularly useful...
soup.findAll(text="Age")
#this is what I really want, but it returns an empty list
soup.findAll('th', text="Age")

Thanks for the help!

Upvotes: 0

Views: 6729

Answers (2)

Zach
Zach

Reputation: 2441

The additional <HR> element is interfering with BeautifulSoup's string processing.

From the BeautifulSoup documentation: "Although text is for finding strings, you can combine it with arguments for finding tags, Beautiful Soup will find all tags whose .string matches your value for text."

You'll find that soup.findAll('th')[2].string is nil, while soup.findAll('th')[2].font.string is u"Age".

To find the required header without changing your markup, you'll have to do something like what TimD suggests:

out = []
headers = soup.findAll("th")
for header in headers:
    if header.find(text="Age"):
        out.append(header)

Upvotes: 1

TimD
TimD

Reputation: 1381

As far as I can tell, you want to get the th object which has the text "Age". There are many ways to skin that cat, basically starting at finding all the th's. From there you can iterate over all of them to find the one that contains age. So the code below should be useful.

out = []
foo = soup.findAll("th")
for bar in foo:
    if bar.find(text"Age"):
        out.append(bar)

Upvotes: 3

Related Questions