MarcusAerlius
MarcusAerlius

Reputation: 63

Gathering information from html files with Python

So, I am a newbie in programming with limited knowledge in both Python and html! what I am trying to do is to run a web-crawling python program to get some specific names from some htmls.

Suppose that I have this html code in some url:

<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman"           SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000"  ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New   Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24,  2005</FONT></TD></TR>

which will look like below:

/s/ ROBERT F. MANGANO

   President, Chief Executive Officer and Director

(Principal Executive Officer)

  March 24, 2005

I want to extract the name, and the title of the person. So, in python, I have written this:

def htmlParser(self):
    pageTree = html.fromstring(self.pageContent)
    print "page parsed!"
    tdTexts =  pageTree.xpath("//td/descendant::*/text()")
    cleanTexts = [eachText.strip() for eachText in tdTexts if eachText.strip()]
    for i in range(1,len(cleanTexts)):
        if ('/s/' in cleanTexts[i] and (i+1) < len(cleanTexts)):
            title = []
            title = [cleanTexts [i+1] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+1].lower()]
            if (title):
                print title
                self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+1]])
                print self.boards
            elif (i+2) < len(cleanTexts):
                title = [cleanTexts [i+2] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+2].lower()]
                if (title):
                    self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+2]])

The only pattern that I have found, is that /s/ that is reoccuring cross the forms, so I am gonna stick to that. The above code works perfect for me. and gives me this:

;ROBERT F. MANGANO;President and Chief Executive Officer

Now, I am facing this other form:

</TR>
<TR VALIGN="TOP">
<TD WIDTH="40%" ALIGN="CENTER" VALIGN="CENTER"><FONT SIZE=2>/s/&nbsp;&nbsp;</FONT><FONT     SIZE=2>JONATHAN C. COON</FONT><FONT SIZE=2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT><HR NOSHADE>    <FONT SIZE=2> Jonathan C. Coon</FONT></TD>
<TD WIDTH="3%" VALIGN="CENTER"><FONT SIZE=2>&nbsp;</FONT></TD>
<TD WIDTH="58%" VALIGN="CENTER"><FONT SIZE=2>Chief Executive Officer and Director (principal    executive officer)</FONT></TD>
 </TR>

which looks like:

/s/  JONATHAN C. COON       Jonathan C. Coon   Chief Executive Officer and Director (principal executive officer)

It is typically the same, but has this " nbsp;nbsp; and FONT" stuff between the /s/ and the name ( In the previous form, the /s/ is just followed by the name.) I do not know that much html, so this is the difference that I catch between these two htmls. If there is something more different, please let me know.

I supposed that my code will work the same for this kind also, because I use "//td/descendant::*/text()" to eliminate all the html tags and stuff and just look at the words. However, when I run the code for the latter html, it gives me: ; ;Chief Executive Officer

As you can see, in this case is unable to catch the name. I cannot figure out how should I alter the code to cover both cases, and because of my little knowledge in html, I was not able to search efficient to solve this problem.

Can anyone please help me how can I modify the code in order to catch both names?

Thanks a lot.

P.S: Sorry if I am not explaining it right. As I said, I am not a pro! PLease let me know if some explanation is missing from my question

Upvotes: 0

Views: 167

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

Use beautifulSoup to parse the html:

from bs4 import BeautifulSoup

html = """
<TR>
<TD VALIGN="top"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New Roman"           SIZE="2">/s/ ROBERT F. MANGANO</FONT></P><HR WIDTH="91%" SIZE="1" NOSHADE COLOR="#000000"  ALIGN="left"></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2"> <P STYLE="margin-top:0px;margin-bottom:0px"><FONT FACE="Times New   Roman" SIZE="2">President, Chief Executive Officer and Director</FONT></P> <P STYLE="margin- top:0px;margin-bottom:1px"><FONT FACE="Times New Roman"
SIZE="2">(Principal Executive Officer)</FONT></P></TD>
<TD VALIGN="bottom"><FONT SIZE="1">&nbsp;</FONT></TD>
<TD VALIGN="top" ROWSPAN="2" ALIGN="center"><FONT FACE="Times New Roman" SIZE="2">March 24,  2005</FONT></TD></TR>
"""

soup = BeautifulSoup(html)

print("\n".join([x.text.strip() for x in soup.find_all("td")]))

/s/ ROBERT F. MANGANO

President, Chief Executive Officer and Director (Principal Executive Officer)

March 24,  2005

Upvotes: 2

Related Questions