Aikin
Aikin

Reputation: 319

Building your own text corpus

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it(((( in this example he stores pos and neg tweets in RedisDB.

Upvotes: 3

Views: 2735

Answers (1)

Curtis
Curtis

Reputation: 556

But what about inner of those files?

This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.

For example, if I want to build corpus with positive and negative tweets

Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.

If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).

Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/

Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research

Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI

For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

Upvotes: 5

Related Questions