Reputation: 2065
I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.
I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)
May I know how this can be done? I tried Outwit but it doesn't seem good enough.
Upvotes: 0
Views: 2235
Reputation: 357
You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.
http://www.crummy.com/software/BeautifulSoup/
In the documentation, there are many commands, however only few commands are important. Following are the important commands:
from bs4 import BeautifulSoup
#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()
#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')
#extract the html elements by id and write result into file
Upvotes: 1
Reputation:
Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.
I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.
Upvotes: 1
Reputation: 1555
Stata is not exactly the best tool for the job. You would have to use low-level file
commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either
expand 2 in l
replace company = "parsed name" in l
replace revenue = parsed_revenue in l
etc., or use post
mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html
does not seem to bring anything usable.
Upvotes: 2