Reputation: 1631
I am using tm
and wordcloud
for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words.
The closest solution i have found is using library(qdapDictionaries)
and building a custom function to check validity of words.
library(qdapDictionaries)
is.word <- function(x) x %in% GradyAugmented
# example
> is.word("aapg")
[1] FALSE
The rest of text mining used is :
curDir <- "E:/folder1/" # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus
The issue is is.word()
works fine for handling dataframes but how to use it for corpus handling ?
Thanks
Upvotes: 3
Views: 4997
Reputation: 14902
If you are willing to try a different text mining package, then this will work:
library(readtext)
library(quanteda)
myCorpus <- corpus(readtext("E:/folder1/*.txt"))
# tokenize the corpus
myTokens <- tokens(myCorpus, remove_punct = TRUE, remove_numbers = TRUE)
# keep only the tokens found in an English dictionary
myTokens <- tokens_select(myTokens, names(data_int_syllables))
From there you can form at document-term matrix (called a "dfm" in quanteda) for analysis, and it will only contain the features found as English-language terms as matched in the dictionary (which contains about 130,000 words).
Upvotes: 6
Reputation: 47320
Not sure if it will be the most resource efficient method (I don't know the package very well) but it should work:
tdm <- TermDocumentMatrix(myCorpus )
all_tokens <- findFreqTerms(tdm, 1)
tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords),
tokens_to_remove)
Upvotes: 2