Reputation: 179
I have about 30k lines of text that are about 50-60 characters long on average. When trying to plot a term-document matrix, it seems the plotting works better (from a correlation perspective) when there's few lines of a lot of text instead of many lines with little text.
For example, if I were to plot a TDM on Pride and Prejudice, it seems like the nodes in the graph have a better performing correlation when the text is all on one line as opposed to each line being a separate corpus.
With the following code:
library("tm")
dd <- read.table("clipboard", sep="\r", quote="")
feedback <- Corpus(VectorSource(dd$V1))
tdm2 <- TermDocumentMatrix(feedback, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
##################################################################
corT = 0.1
freq = 75
freqterms <- findFreqTerms(tdm2, lowfreq = freq)#[1:29]
vtxcnt <- rowSums(cor(as.matrix(t(tdm2[freqterms,])))>corT)-1
mycols<-c("#f7fbff","#deebf7","#c6dbef",
"#9ecae1","#6baed6","#4292c6",
"#2171b5", "#084594")
vc <- mycols[vtxcnt+1]
names(vc) <- names(vtxcnt)
##################################################################
plot(tdm2,
terms = freqterms,
#weighting = TRUE,
corThreshold = corT,
nodeAttrs=list(fillcolor=vc))
I produce the following plot if the text is taken as-is from Gutenberg.org:
This is with a 0.1 correlation threshold and using the 75 most frequent terms. Not very interesting. If I instead take the entire book as a single line and re-run the code with corT=0.9 and freq=175, then we get:
Which seems a lot more informative. Is there a way to pull in text via clipboard or otherwise in a corpus that doesn't have each line as its own 'book' in the corpus? Does the Corpus() function work only on a vector source or could I do something like readlines() to have the data come in from clipboard as a single corpus? What I had been doing was just taking a text document and merging the lines manually from a few thousand down to a few dozens, but I feel like there has to be a better solution here.
Upvotes: 0
Views: 107
Reputation: 655
I can't speak for the efficiency of this solution, but it works for me:
feedback <- Corpus(VectorSource(concat(dd$V1,collapse=" ")))
UPDATE: I forgot to mention that concat() is from the ngram package
Upvotes: 0