How to make or get corpus of financial documents

Question

I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.

Skillachie · Accepted Answer

You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.

The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.

from news_corpus_builder import NewsCorpusGenerator

# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'

# Save results to sqlite or  files per article 
ex = NewsCorpusGenerator(corpus_dir,'sqlite')

# Retrieve 50 links related to the search term dogs and assign a category of   Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)

# Generate and save corpus
ex.generate_corpus(links)

More details on my blog

The finance corpus is available for download here . The corpus has the following categories:

Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity)
Real Estate
Mergers & Acquisitions (merger,acquisitions)
Oil(oil,oil prices,natural gas price)
Commodities (commodities,gold ,silver)
Fraud(insider trading, ponzi scheme, finance fraud)
Litigation (company litigation, company settlement,)
Earning Reports

How to make or get corpus of financial documents

Answers (2)

Related Questions