Reputation: 479
I have a lot of html files look like this:
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
Summary:
</b>
According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
SIC Code:
</b>
0000
<br />
<b>
Sector:
</b>
N/A
<br />
<b>
Industry:
</b>
N/A
<br />
</font>
what I want to do is taking out the text in the middle of the file and transform it into a human-readable format. in this example, it is:
According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
I know I have to do 3 things, they are:
"<br />"
with "\n"
" "
with " "
(one space)I know the latter 2 things are easy, just using the replace method in Python, but I don't know how to achieve the first goal.
I know regular expression and BeautifulSoup a little, but I don't know how to apply them to this question.
Can someone help me?
Thanks, and I'm sorry for my poor English.
@Paul: I want just a section which is the summary. My teacher (who doesn't know much about computers) gives me a lot of html files and asks me to transform them into a format which is proper for data mining (My teacher try to use SAS to do this). I don't know SAS, but I think it may used to handle a lot of txt files, so I want to transform these html files into normal txt files.
@Owen: I need to handle a lot of html files and I think this problem isn't too difficult to handle, so I want to solve it directly with Python.
Upvotes: 3
Views: 2571
Reputation: 2783
You can use Scrapely.
Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
http://github.com/scrapy/scrapely
Upvotes: 3
Reputation: 123831
Nearest one would be convert HTML to reStructureText, you can try online here, which output following.
**Summary:** According to the complaint filed January 04, 2011, over a
six-week period in December 2007 and January 2008, six healthcare
related hedge funds managed by Defendant FrontPoint Partners LLC
(“FrontPoint”) sold more than six million shares of Human Genome
Sciences, Inc. (“HGSI”) common stock while their portfolio manager
possessed material negative non-public information concerning the HGSI’s
clinical trial for the drug Albumin Interferon Alfa 2-a.
On March 2, 2011, the plaintiffs filed a First Amended Class Action
Complaint, amending the named defendants and securities violations. On
March 22, 2011, a motion for appointment as lead plaintiff and for
approval of selection of lead counsel was filed. The defendants
responded to the First Amended Complaint by filing a motion to dismiss
on March 28, 2011.
--------------
INDUSTRY CLASSIFICATION:
**SIC Code:** 0000
**Sector:** N/A
**Industry:** N/A
Upvotes: 1
Reputation: 7674
To accomplish this task, you can use the help of a Python library called Lxml.
Now try running the following code:
from lxml.html import fromstring
html = '''
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
Summary:
</b>
According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
SIC Code:
</b>
0000
<br />
<b>
Sector:
</b>
N/A
<br />
<b>
Industry:
</b>
N/A
<br />
</font>
'''
htmlElement = fromstring(html)
textContent = htmlElement.text_content()
result = textContent.split('\n\n Summary:\n\n')[1].split('\n\nINDUSTRY CLASSIFICATION:\n\n')[0]
print result
This code will work if '\n\n Summary:\n\n' comes before the desired text and '\n\n INDUSTRY CLASSIFICATION:\n\n' comes after the desired text.
Upvotes: 2