Reputation: 875
I'm trying to match the TH tag in the below HTML (file.txt):
<TABLE WIDTH="71%" BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR VALIGN="BOTTOM">
<TH WIDTH="34%" ALIGN="LEFT"><FONT SIZE=1><B>Name<BR> </B></FONT><HR NOSHADE></TH>
<TH WIDTH="3%"><FONT SIZE=1> </FONT></TH>
<TH WIDTH="5%" ALIGN="CENTER"><FONT SIZE=1><B>Age</B></FONT><HR NOSHADE></TH>
<TH WIDTH="3%"><FONT SIZE=1> </FONT></TH>
<TH WIDTH="55%" ALIGN="CENTER"><FONT SIZE=1><B>Positions</B></FONT><HR NOSHADE></TH>
</TR>
<TR BGCOLOR="#CCEEFF" VALIGN="TOP">
<TD WIDTH="34%"><FONT SIZE=2>Stephen A. Wynn</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2> </FONT></TD>
<TD WIDTH="5%" ALIGN="CENTER"><FONT SIZE=2>60</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2> </FONT></TD>
<TD WIDTH="55%"><FONT SIZE=2>Chairman of the Board and Chief Executive Officer</FONT></TD>
</TR>
<TR BGCOLOR="White" VALIGN="TOP">
<TD WIDTH="34%"><FONT SIZE=2>Kazuo Okada</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2> </FONT></TD>
<TD WIDTH="5%" ALIGN="CENTER"><FONT SIZE=2>60</FONT></TD>
<TD WIDTH="3%"><FONT SIZE=2> </FONT></TD>
<TD WIDTH="55%"><FONT SIZE=2>Vice Chairman of the Board</FONT></TD>
</TR>
</TABLE>
I have tried the following, but it doesn't seem to work:
from bs4 import BeautifulSoup
infile = open("file.txt")
soup = BeautifulSoup(infile.read())
#this works
soup.findAll('th')
#this works but isn't particularly useful...
soup.findAll(text="Age")
#this is what I really want, but it returns an empty list
soup.findAll('th', text="Age")
Thanks for the help!
Upvotes: 0
Views: 6729
Reputation: 2441
The additional <HR>
element is interfering with BeautifulSoup's string processing.
From the BeautifulSoup documentation: "Although text is for finding strings, you can combine it with arguments for finding tags, Beautiful Soup will find all tags whose .string matches your value for text."
You'll find that soup.findAll('th')[2].string
is nil
, while soup.findAll('th')[2].font.string
is u"Age"
.
To find the required header without changing your markup, you'll have to do something like what TimD suggests:
out = []
headers = soup.findAll("th")
for header in headers:
if header.find(text="Age"):
out.append(header)
Upvotes: 1
Reputation: 1381
As far as I can tell, you want to get the th object which has the text "Age". There are many ways to skin that cat, basically starting at finding all the th's. From there you can iterate over all of them to find the one that contains age. So the code below should be useful.
out = []
foo = soup.findAll("th")
for bar in foo:
if bar.find(text"Age"):
out.append(bar)
Upvotes: 3