Reputation: 15
I am working on a text mining assignment and am stuck at the moment. The following is based on Zhaos Text Mining
with Twitter. I cannot get it to work, maybe one of you has a good idea?
Goal: I would like to remove all terms from the corpus with a word count of one instead of using a stopword list.
What I did so far: I have downloaded the tweets and converted them into a data frame.
tf1 <- Corpus(VectorSource(tweets.df$text))
tf1 <- tm_map(tf1, content_transformer(tolower))
removeUser <- function(x) gsub("@[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeUser))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeNumPunct))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeURL))
tf1 <- tm_map(tf1, stripWhitespace)
#Using TermDocMatrix in order to find terms with count 1, dont know any other way
tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))
ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)
tf1Copy <- tf1
tf1List <- setdiff(tf1Copy, ones)
tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")
tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)
tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))
#Just to test success...
ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)
(ones2)
The Error:
Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(senior data scientist global strategy firm
25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll primary programming language for analytics data mining data scienc
25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra seminar mining the internet of everything official statistics in the information age anu june
25.0020229816437 48 17 6 6 115 1 186 0 3 en handling and processing strings in r an ebook in pdf format pages
25.0020229816437 48 17 6 6 115 1 186 0 4 en webinar getting your data into r by hadley wickham am edt june th
25.0020229816437 48 17 6 6 115 1 186 0 5 en before loading the rdmtweets dataset please run librarytwitter to load required package
25.0020229816437 48 17 6 6 115 1 186 0 6 en an infographic on sas vs r vs python datascience via
25.0020229816437 48 17 6 6 115 1 186 0 7 en r is again the kdnuggets poll on top analytics data mining science software
25.0020229816437 48 17 6 6 115 1 186 0 8 en i will run
In Addition:
Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'regular expression is too large' at ''
PS sorry for the bad format at the end could not get it fixed.
Upvotes: 0
Views: 1371
Reputation: 14902
Here's a simpler method using the dfm()
and trim()
functions from the quanteda package:
require(quanteda)
mydfm <- dfm(c("This is a doc", "This is another doc"), verbose = FALSE)
mydfm
## Document-feature matrix of: 2 documents, 5 features.
## 2 x 5 sparse Matrix of class "dfmSparse"
## features
## docs a another doc is this
## text1 1 0 1 1 1
## text2 0 1 1 1 1
trim(mydfm, minCount = 2)
## Features occurring less than 2 times: 2
## Document-feature matrix of: 2 documents, 3 features.
## 2 x 3 sparse Matrix of class "dfmSparse"
## features
## docs doc is this
## text1 1 1 1
## text2 1 1 1
Upvotes: 0
Reputation: 54237
Here's a way how to remove all terms from the corpus with a word count of one:
library(tm)
mytweets <- c("This is a doc", "This is another doc")
corp <- Corpus(VectorSource(mytweets))
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is another doc
## ^^^
dtm <- DocumentTermMatrix(corp)
inspect(dtm)
# Terms
# Docs another doc this
# 1 0 1 1
# 2 1 1 1
(stopwords <- findFreqTerms(dtm, 1, 1))
# [1] "another"
corp <- tm_map(corp, removeWords, stopwords)
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
#
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is doc
## ^ 'another' is gone
(As a side note: The token 'a' from 'This is a...' is gone, too, because DocumentTermMatrix
cuts out tokens with a length < 3 by default.)
Upvotes: 1