Reputation: 195
I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
Upvotes: 1
Views: 865
Reputation: 1
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.
Upvotes: 0
Reputation: 1066
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
Upvotes: 1