Reputation: 63
So, I am a newbie in programming with limited knowledge in both Python and html! what I am trying to do is to run a web-crawling python program to get some specific names from some htmls.
Suppose that I have this html code in some url:
<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman" SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000" ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24, 2005</FONT></TD></TR>
which will look like below:
/s/ ROBERT F. MANGANO
President, Chief Executive Officer and Director (Principal Executive Officer) March 24, 2005
I want to extract the name, and the title of the person. So, in python, I have written this:
def htmlParser(self):
pageTree = html.fromstring(self.pageContent)
print "page parsed!"
tdTexts = pageTree.xpath("//td/descendant::*/text()")
cleanTexts = [eachText.strip() for eachText in tdTexts if eachText.strip()]
for i in range(1,len(cleanTexts)):
if ('/s/' in cleanTexts[i] and (i+1) < len(cleanTexts)):
title = []
title = [cleanTexts [i+1] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+1].lower()]
if (title):
print title
self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+1]])
print self.boards
elif (i+2) < len(cleanTexts):
title = [cleanTexts [i+2] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+2].lower()]
if (title):
self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+2]])
The only pattern that I have found, is that /s/ that is reoccuring cross the forms, so I am gonna stick to that. The above code works perfect for me. and gives me this:
;ROBERT F. MANGANO;President and Chief Executive Officer
Now, I am facing this other form:
</TR>
<TR VALIGN="TOP">
<TD WIDTH="40%" ALIGN="CENTER" VALIGN="CENTER"><FONT SIZE=2>/s/ </FONT><FONT SIZE=2>JONATHAN C. COON</FONT><FONT SIZE=2> </FONT><HR NOSHADE> <FONT SIZE=2> Jonathan C. Coon</FONT></TD>
<TD WIDTH="3%" VALIGN="CENTER"><FONT SIZE=2> </FONT></TD>
<TD WIDTH="58%" VALIGN="CENTER"><FONT SIZE=2>Chief Executive Officer and Director (principal executive officer)</FONT></TD>
</TR>
which looks like:
/s/ JONATHAN C. COON Jonathan C. Coon Chief Executive Officer and Director (principal executive officer)
It is typically the same, but has this " nbsp;nbsp; and FONT" stuff between the /s/ and the name ( In the previous form, the /s/ is just followed by the name.) I do not know that much html, so this is the difference that I catch between these two htmls. If there is something more different, please let me know.
I supposed that my code will work the same for this kind also, because I use "//td/descendant::*/text()" to eliminate all the html tags and stuff and just look at the words. However, when I run the code for the latter html, it gives me: ; ;Chief Executive Officer
As you can see, in this case is unable to catch the name. I cannot figure out how should I alter the code to cover both cases, and because of my little knowledge in html, I was not able to search efficient to solve this problem.
Can anyone please help me how can I modify the code in order to catch both names?
Thanks a lot.
P.S: Sorry if I am not explaining it right. As I said, I am not a pro! PLease let me know if some explanation is missing from my question
Upvotes: 0
Views: 167
Reputation: 180441
Use beautifulSoup to parse the html:
from bs4 import BeautifulSoup
html = """
<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman" SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000" ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24, 2005</FONT></TD></TR>
"""
soup = BeautifulSoup(html)
print("\n".join([x.text.strip() for x in soup.find_all("td")]))
/s/ ROBERT F. MANGANO
President, Chief Executive Officer and Director (Principal Executive Officer)
March 24, 2005
Upvotes: 2