Reputation: 8018
I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.
Upvotes: 3
Views: 6178
Reputation: 3226
You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.
The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.
from news_corpus_builder import NewsCorpusGenerator
# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'
# Save results to sqlite or files per article
ex = NewsCorpusGenerator(corpus_dir,'sqlite')
# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)
# Generate and save corpus
ex.generate_corpus(links)
More details on my blog
The finance corpus is available for download here . The corpus has the following categories:
Upvotes: 6
Reputation: 310
You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/
It is a basic corpus for test classification.
Upvotes: 3