jessica
jessica

Reputation: 1355

Data Scraping Twitter Data in R

I am pulling in data from Twitter into R and I am hitting two stumbling blocks.

twit=searchTwitter("justin timerlake",n=30,lang = "en") 
twit_text=sapply(twit, function(x) x$getText())
corpus=Corpus(VectorSource(twit_text))

1) How do I access the string comments in the corpus?? I tried print(corpus) but its not printing. Instead I get this message.

print(corpus)
A corpus with 30 text documents

2) I am trying to lowercase all the text in the corpus but I am having little success.

I tried these following commands

 tm_map(corpus, content_transformer(tolower))
Error in match.fun(FUN) : could not find function "content_transformer"

tm_map(corpus,Content(tolower))
Error in UseMethod("Content", x) : 
  no applicable method for 'Content' applied to an object of class "function"

tolower(twit_text) 

the last one seems to stop on messages with weird characters inside such as "í ½í²™"

Upvotes: 0

Views: 161

Answers (1)

amrrs
amrrs

Reputation: 6325

To convert it to lower:

corpus = tm_map(corpus, tolower)

You can access the text in the corpus by converting it to Document Term Matrix (DTM):

dtm <- DocumentTermMatrix(corpus)

Edit

Typcical Text Cleaning Functions:

corpus = tm_map(corpus, tolower);
corpus = tm_map(corpus, removePunctuation);
corpus = tm_map(corpus, removeNumbers);
corpus <- tm_map(corpus, PlainTextDocument)

Upvotes: 1

Related Questions