Marc van der Peet
Marc van der Peet

Reputation: 343

Transforming list of documents into corpus

Im doing the following tutorial on text mining: http://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/

Everything is quite clear but there one thing i do not get:

At a certain moment the list of documents is converted into a corpus:

doc.vec <- VectorSource(shakespeare)
doc.corpus <- Corpus(doc.vec)

Could anybody explain to me in plain English (preferable with an example) what's happing under the hood here?

Upvotes: 1

Views: 5170

Answers (2)

scoa
scoa

Reputation: 19857

I am guessing that the trouble comes from the VectorSource part of the code : why do we need this extra step to create a corpus?

Corpuses are R object that hold text and metadata. They are created by the function tm::Corpus. It basically transforms a collection of texts into a well-formatted object that other text mining function are able to understand.

However, documents can come in many different forms. Let's consider two of them.

  • The documents are a bunch of text files on your computer, each holding one document.
  • The documents are stored in a character vector in R, each observation being a document.

The function Corpus is not able to differentiate those two sources by itself. This is where the various Source functions come in. They preformat the documents according to the kind of source, so that Corpus is able to understand it.

If, for instance, what you had was a directory named shakespeare on your computer, with one text file for each play (midsummer.txt, hamlet.txt, etc.), you would create your corpus like this :

corpus <- Corpus(DirSource(directory="/path/to/shakespeare"))

This would read the files one by one and add them as documents to the corpus.

If, as is the case in your tutorial, those documents had already been read into R, through readLines for instance, and were made into a data.frame.

shakespeare <- data.frame(title=c("midsummer","hamlet"),
                          text=c("Love looks not with the eyes...","to be or not to be..."))

Then you would have to adjust and use VectorSource

corpus(VectorSource(shakespeare$text))

For more information, read ?Source and ?Corpus. You will see that there are other possibles sources, but I personally never use them.

Upvotes: 2

Dr VComas
Dr VComas

Reputation: 735

The Corpus is the main structure of the tm package. Think it as a data structure, used in tm for your list of documents. Later on all your text mining analytics and insights are going to result from transformations done to your Corpus. If you want to read and understand in more details I would suggest you to read:

https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf Hope it helps!

Upvotes: 0

Related Questions