Gathering information from html files with Python

Question

So, I am a newbie in programming with limited knowledge in both Python and html! what I am trying to do is to run a web-crawling python program to get some specific names from some htmls.

Suppose that I have this html code in some url:


 /s/ ROBERT F. MANGANO

  
 President, Chief Executive Officer and Director
 (Principal Executive Officer)
 
March 24,  2005

which will look like below:

/s/ ROBERT F. MANGANO
President, Chief Executive Officer and Director
(Principal Executive Officer)
March 24, 2005

I want to extract the name, and the title of the person. So, in python, I have written this:

def htmlParser(self):
    pageTree = html.fromstring(self.pageContent)
    print "page parsed!"
    tdTexts =  pageTree.xpath("//td/descendant::*/text()")
    cleanTexts = [eachText.strip() for eachText in tdTexts if eachText.strip()]
    for i in range(1,len(cleanTexts)):
        if ('/s/' in cleanTexts[i] and (i+1) < len(cleanTexts)):
            title = []
            title = [cleanTexts [i+1] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+1].lower()]
            if (title):
                print title
                self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+1]])
                print self.boards
            elif (i+2) < len(cleanTexts):
                title = [cleanTexts [i+2] for eachKeyword in titleKeywords if eachKeyword in cleanTexts [i+2].lower()]
                if (title):
                    self.boards.append([self.pageURL,cleanTexts[i].replace('/s/',''),cleanTexts [i+2]])

The only pattern that I have found, is that /s/ that is reoccuring cross the forms, so I am gonna stick to that. The above code works perfect for me. and gives me this:

;ROBERT F. MANGANO;President and Chief Executive Officer

Now, I am facing this other form:



/s/  JONATHAN C. COON      
     Jonathan C. Coon
 
Chief Executive Officer and Director (principal    executive officer)

which looks like:

/s/ JONATHAN C. COON Jonathan C. Coon Chief Executive Officer and Director (principal executive officer)

It is typically the same, but has this " nbsp;nbsp; and FONT" stuff between the /s/ and the name ( In the previous form, the /s/ is just followed by the name.) I do not know that much html, so this is the difference that I catch between these two htmls. If there is something more different, please let me know.

I supposed that my code will work the same for this kind also, because I use "//td/descendant::*/text()" to eliminate all the html tags and stuff and just look at the words. However, when I run the code for the latter html, it gives me: ; ;Chief Executive Officer

As you can see, in this case is unable to catch the name. I cannot figure out how should I alter the code to cover both cases, and because of my little knowledge in html, I was not able to search efficient to solve this problem.

Can anyone please help me how can I modify the code in order to catch both names?

Thanks a lot.

P.S: Sorry if I am not explaining it right. As I said, I am not a pro! PLease let me know if some explanation is missing from my question

Gathering information from html files with Python

Answers (1)

Related Questions