wwl
wwl

Reputation: 2065

Scraping data into Stata

I have 40,000 HTML files. Each file has a table containing the profit & loss statement of a particular company.

I would like to scrape all these data into Stata. (Or alternatively, into an Excel/CSV file). The end product should be a Stata/Excel file containing a list of all companies and details of their balance sheet (revenue, profit, etc.)

May I know how this can be done? I tried Outwit but it doesn't seem good enough.

Upvotes: 0

Views: 2235

Answers (3)

Vivek
Vivek

Reputation: 357

You should use Python beautifulsoup package. It is very handy in extracting data from HTML files. Following is the link.

http://www.crummy.com/software/BeautifulSoup/

In the documentation, there are many commands, however only few commands are important. Following are the important commands:

from bs4 import BeautifulSoup

#read the file
fp=open(file_name,'r')
data=fp.read()
fp.close()


#pass the data to beautifulsoup
soup = BeautifulSoup(html_doc, 'html.parser')

#extract the html elements by id and write result into file

Upvotes: 1

user872324
user872324

Reputation:

Stata is not the good tool for this job. In principle it is possible. Personally I have already done similar things: reading ascii files into Stata, parsing them and extracting information fro them. I have dumped the data into Stata using insheet. Then I have treated the data with Stata's string functions. It was a bit cumbersome. And the files had quite a simple and clear structure. I don't want to imagine what happens when the files have a more complicated structure.

I think that the best strategy is to use a scripting language such as Python, Perl or Ruby. to extract the information contained in the html tables. The results can easily be written into a csv, Excel or even a Stata (.dta) file.

Upvotes: 1

StasK
StasK

Reputation: 1555

Stata is not exactly the best tool for the job. You would have to use low-level file commands to read the input text files, and then parse out the relevant tables (again, using low level string processing). Putting them into data set is the easiest part; you can either

    expand 2 in l
    replace company = "parsed name" in l
    replace revenue = parsed_revenue in l

etc., or use post mechanics. With some luck, you'd find some packages that may make it simpler, but I am not aware of any, and findit html does not seem to bring anything usable.

Upvotes: 2

Related Questions