Reputation: 343
Im doing the following tutorial on text mining: http://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
Everything is quite clear but there one thing i do not get:
At a certain moment the list of documents is converted into a corpus:
doc.vec <- VectorSource(shakespeare)
doc.corpus <- Corpus(doc.vec)
Could anybody explain to me in plain English (preferable with an example) what's happing under the hood here?
Upvotes: 1
Views: 5170
Reputation: 19857
I am guessing that the trouble comes from the VectorSource
part of the code : why do we need this extra step to create a corpus?
Corpuses are R object that hold text and metadata. They are created by the function tm::Corpus
. It basically transforms a collection of texts into a well-formatted object that other text mining function are able to understand.
However, documents can come in many different forms. Let's consider two of them.
The function Corpus
is not able to differentiate those two sources by itself. This is where the various Source
functions come in. They preformat the documents according to the kind of source, so that Corpus
is able to understand it.
If, for instance, what you had was a directory named shakespeare
on your computer, with one text file for each play (midsummer.txt
, hamlet.txt
, etc.), you would create your corpus like this :
corpus <- Corpus(DirSource(directory="/path/to/shakespeare"))
This would read the files one by one and add them as documents to the corpus.
If, as is the case in your tutorial, those documents had already been read into R, through readLines
for instance, and were made into a data.frame.
shakespeare <- data.frame(title=c("midsummer","hamlet"),
text=c("Love looks not with the eyes...","to be or not to be..."))
Then you would have to adjust and use VectorSource
corpus(VectorSource(shakespeare$text))
For more information, read ?Source
and ?Corpus
. You will see that there are other possibles sources, but I personally never use them.
Upvotes: 2
Reputation: 735
The Corpus is the main structure of the tm package. Think it as a data structure, used in tm for your list of documents. Later on all your text mining analytics and insights are going to result from transformations done to your Corpus. If you want to read and understand in more details I would suggest you to read:
https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf Hope it helps!
Upvotes: 0