AI52487963
AI52487963

Reputation: 179

Corpus from clipboard: many lines as one document?

I have about 30k lines of text that are about 50-60 characters long on average. When trying to plot a term-document matrix, it seems the plotting works better (from a correlation perspective) when there's few lines of a lot of text instead of many lines with little text.

For example, if I were to plot a TDM on Pride and Prejudice, it seems like the nodes in the graph have a better performing correlation when the text is all on one line as opposed to each line being a separate corpus.

With the following code:

library("tm")

dd <- read.table("clipboard", sep="\r", quote="")
feedback <- Corpus(VectorSource(dd$V1))

tdm2 <- TermDocumentMatrix(feedback, control = list(removePunctuation = TRUE,
                                                    removeNumbers = TRUE,
                                                    stopwords = TRUE))



##################################################################
corT = 0.1
freq = 75

freqterms <- findFreqTerms(tdm2, lowfreq = freq)#[1:29]

vtxcnt <- rowSums(cor(as.matrix(t(tdm2[freqterms,])))>corT)-1

mycols<-c("#f7fbff","#deebf7","#c6dbef",
          "#9ecae1","#6baed6","#4292c6",
          "#2171b5", "#084594")

vc <- mycols[vtxcnt+1]
names(vc) <- names(vtxcnt)
##################################################################

plot(tdm2, 
     terms = freqterms, 
     #weighting = TRUE,
     corThreshold = corT,
     nodeAttrs=list(fillcolor=vc))

I produce the following plot if the text is taken as-is from Gutenberg.org:

enter image description here

This is with a 0.1 correlation threshold and using the 75 most frequent terms. Not very interesting. If I instead take the entire book as a single line and re-run the code with corT=0.9 and freq=175, then we get:

enter image description here

Which seems a lot more informative. Is there a way to pull in text via clipboard or otherwise in a corpus that doesn't have each line as its own 'book' in the corpus? Does the Corpus() function work only on a vector source or could I do something like readlines() to have the data come in from clipboard as a single corpus? What I had been doing was just taking a text document and merging the lines manually from a few thousand down to a few dozens, but I feel like there has to be a better solution here.

Upvotes: 0

Views: 107

Answers (1)

Vincent Rupp
Vincent Rupp

Reputation: 655

I can't speak for the efficiency of this solution, but it works for me:

feedback <- Corpus(VectorSource(concat(dd$V1,collapse=" ")))

UPDATE: I forgot to mention that concat() is from the ngram package

Upvotes: 0

Related Questions