Anna Yashina
Anna Yashina

Reputation: 534

Random sample of tweets of a time period using TwitteR

I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:

  1. It would definitely take a lot of time if I specify like retryOnRateLimit = 120
  2. I'll get blocked soon if I don't and get tweet just for a half of a day

The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.

My code is here:

a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)

The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours

P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().

Upvotes: 4

Views: 1439

Answers (1)

mkearney
mkearney

Reputation: 1335

The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.

Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.

library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
    rt[[i]] <- search_tweets("hashtag", n = 200000, 
                             retyonratelimit = TRUE,
                             max_id = maxid)
    maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()

Upvotes: 2

Related Questions