Hana
Hana

Reputation: 157

Analyzing Twitter data using R

I am trying to analyze Twitter data using R, by plotting the number of tweets over a period of time, when I write

plot(tweet_df$created_at, tweet_df$text)

I got this error message:

Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf

Here is the code which I used:

library("rjson")
json_file <- "tweet.json"
json_data <- fromJSON(file=json_file)
library("streamR")
tweet_df <- parseTweets(tweets=file)
#using the twitter data frame
tweet_df$created_at
tweet_df$text
plot(tweet_df$created_at, tweet_df$text) 

Upvotes: 1

Views: 1152

Answers (1)

jed
jed

Reputation: 615

You've got a couple issues here, but nothing insurmountable. If you want to track tweets over time, you're really asking for the tweets created per x time frame (tweets per minute, second, whatever). So that means you only need the created_at column, and you can build the graph with R's hist function.

If you want to split by words mentioned in text or whatever, that's doable too but you should probably use ggplot2 to do it and maybe ask a different question. Anyways it looks like parseTweets converts twitters timestamps to a character field, so you'll want to turn it into a POSIXct timestamp field that R can understand. Assuming you have a data frame that looks something like this:

❥ head(tweet_df[,c("id_str","created_at")])
              id_str                     created_at
1 597862782101561346 Mon May 11 20:36:09 +0000 2015
2 597862782097346560 Mon May 11 20:36:09 +0000 2015
3 597862782105694208 Mon May 11 20:36:09 +0000 2015
4 597862782105694210 Mon May 11 20:36:09 +0000 2015
5 597862782076198912 Mon May 11 20:36:09 +0000 2015
6 597862782114078720 Mon May 11 20:36:09 +0000 2015

You can do that like this:

❥ dated_tweets <- as.POSIXct(tweet_df$created_at, format = "%a %b %d %H:%M:%S +0000 %Y")

That will give you a vector of dated tweets in R's timestamp format. You can then plot them like this. I left open the sample twitter feed for 15 mins or so. This is the result:

❥ hist(dated_tweets, breaks ="secs", freq = TRUE)

enter image description here

Upvotes: 3

Related Questions