Reputation: 381
I am new to the tm package in R. I am trying to create a document-term matrix with the tm_map
function, but apparently the function passed to tm_map(Corpus, function, lazy=TRUE)
is not applied to the corpus. Concretely, the documents are not converted to lower case. R Studio does not show any errors or warnings.
Did I mess up anything here? Could this be some enconding issue?
library(tm)
setwd("...")
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""), lazy=TRUE)
#to lower case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
writeLines(as.character(docs[[30]]))
Thank you for any advice!
Upvotes: 0
Views: 493
Reputation: 58
This is a simple fix. Move your code for converting to lower case before iconv(...).
This works:
library(tm)
setwd("")
# Read in Files
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
# Lower Case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
# Convert
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""))
writeLines(as.character(docs[[30]]))
Upvotes: 1