Reputation: 2459
I have a list of tweet ids (tids.csv
) for which I need to collect ALL retweets. Since we can’t directly retrieve retweets of a specific tweet in v2 API, I had to get all retweets of a specific user (users of the tweets in my tid file) and then filter out the retweets of the specific tweets we’re interested in. Here is the command I used to get all retweets of users:
while read line; do twarc2 search --archive --start-time "..." --end-time "..." "retweets_of:$line"; done < usernames.txt > usersRetweets.jsonl
where usernames.txt is a text file including a list of usernames.
Note: I'm using v2 API because v1 API has a limitation of retrieving only 100 most recent retweets, but I need all retweets.
PROBLEM:
There are ~17 thousand unique usernames
in my input text file and I found it takes a long time and a very large space to get all retweets of this number of users. For example, it took a couple of days and ~5 GB of space to collect retweets for only ~600 users.
QUESTION: What is the most efficient way to get all retweets in terms of time and space?
The above command retrieves all retweets of all tweets for each user, while I only need all retweets of the specific tweets in my dataset (tids.csv).
POSSIBLE SOLUTION: Here is what I’m thinking to do:
For each user in username.txt:
tid.csv
file and remove the rest. This can be done by matching the id in referenced_tweets
field with the ids in tid.csv file)How can I do it in Python (using twarc2 as a library)?
Upvotes: 1
Views: 796