Reputation: 1355
I am pulling in data from Twitter into R and I am hitting two stumbling blocks.
twit=searchTwitter("justin timerlake",n=30,lang = "en")
twit_text=sapply(twit, function(x) x$getText())
corpus=Corpus(VectorSource(twit_text))
1) How do I access the string comments in the corpus?? I tried print(corpus) but its not printing. Instead I get this message.
print(corpus)
A corpus with 30 text documents
2) I am trying to lowercase all the text in the corpus but I am having little success.
I tried these following commands
tm_map(corpus, content_transformer(tolower))
Error in match.fun(FUN) : could not find function "content_transformer"
tm_map(corpus,Content(tolower))
Error in UseMethod("Content", x) :
no applicable method for 'Content' applied to an object of class "function"
tolower(twit_text)
the last one seems to stop on messages with weird characters inside such as "í ½í²™"
Upvotes: 0
Views: 161
Reputation: 6325
To convert it to lower:
corpus = tm_map(corpus, tolower)
You can access the text in the corpus by converting it to Document Term Matrix (DTM):
dtm <- DocumentTermMatrix(corpus)
Edit
Typcical Text Cleaning Functions:
corpus = tm_map(corpus, tolower);
corpus = tm_map(corpus, removePunctuation);
corpus = tm_map(corpus, removeNumbers);
corpus <- tm_map(corpus, PlainTextDocument)
Upvotes: 1