AbtPst
AbtPst

Reputation: 8018

How to make or get corpus of financial documents

I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.

Upvotes: 3

Views: 6178

Answers (2)

Skillachie
Skillachie

Reputation: 3226

You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.

The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.

from news_corpus_builder import NewsCorpusGenerator

# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'

# Save results to sqlite or  files per article 
ex = NewsCorpusGenerator(corpus_dir,'sqlite')

# Retrieve 50 links related to the search term dogs and assign a category of   Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)

# Generate and save corpus
ex.generate_corpus(links)

More details on my blog

The finance corpus is available for download here . The corpus has the following categories:

  • Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
  • International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
  • Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity)
  • Real Estate
  • Mergers & Acquisitions (merger,acquisitions)
  • Oil(oil,oil prices,natural gas price)
  • Commodities (commodities,gold ,silver)
  • Fraud(insider trading, ponzi scheme, finance fraud)
  • Litigation (company litigation, company settlement,)
  • Earning Reports

Upvotes: 6

Istvan Nagy
Istvan Nagy

Reputation: 310

You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/

It is a basic corpus for test classification.

Upvotes: 3

Related Questions