Cisplatin
Cisplatin

Reputation: 2998

Is there a resource of lots of human text?

I just coded a Markov chain that talks based on learned data. I'd like a resource of a lot of text data online, but can't seem to find any (most sites like Wikipedia have a lot of junk, not plain text files).

Is there any site that would have a lot of text file that is suitable to test a Markov chain on?

Upvotes: 0

Views: 44

Answers (4)

Ewan Mellor
Ewan Mellor

Reputation: 6847

Consider the Enron Email Dataset: https://www.cs.cmu.edu/~./enron/

It is also hosted on Amazon AWS: https://aws.amazon.com/datasets/enron-email-data/

Upvotes: 0

mock_blatt
mock_blatt

Reputation: 965

gutenberg.org might have some resources for you. For example, here's what appears to be a bunch of Moby Dick, in text file form.

http://www.gutenberg.org/files/2701/2701.txt

Upvotes: 2

cytsunny
cytsunny

Reputation: 5030

If your concern is just removing the tag from wikipedia, how about using source like this one that they remove the tag for you?

http://kopiwiki.dsd.sztaki.hu/

Upvotes: 1

Warden
Warden

Reputation: 106

Have you tried NLTK text corpora?

Upvotes: 0

Related Questions