max_max_mir
max_max_mir

Reputation: 1736

R tm prevent lower case conversion when using DocumentTermMatrix

When I use DocumentTermMatrix on my corpus, it lowercases the words. I'd like to preserve the camel case. How do I do it?

as.matrix(DocumentTermMatrix(Corpus(VectorSource(c("Hello", "World")))))

I'd like the column names to be Hello and World instead of hello and world.

Upvotes: 2

Views: 779

Answers (2)

Azam Yahya
Azam Yahya

Reputation: 53

capitalize function in library(Hmisc) does the job for me as a beginner.

library(Hmisc)

terms <- as.matrix(DocumentTermMatrix(Corpus(VectorSource(c("Hello", "World")))))

colnames(terms) <- capitalize(colnames(terms))

terms

    Terms
Docs Hello World
  1     1     0
  2     0     1

Upvotes: 0

Sandipan Dey
Sandipan Dey

Reputation: 23109

You can try the following hack:

words <- c("Hello", "World")
tdm <- as.data.frame(as.matrix(DocumentTermMatrix(Corpus(VectorSource(words)))))
names(tdm) <- sort(words) # need to sort alphabetically
tdm
#  Hello World
#1     1     0
#2     0     1

Cleaner way to do the same:

words <- c("Hello", "World")
tdm <- as.data.frame(as.matrix(DocumentTermMatrix(Corpus(VectorSource(factor(words))), 
                                                         control=list(tolower=FALSE))))
tdm
#  Hello World
#1     1     0
#2     0     1

Upvotes: 2

Related Questions