What is the most efficient way to get all retweets for a list of tweet ids using Twitter API V2

Question

I have a list of tweet ids (tids.csv) for which I need to collect ALL retweets. Since we can’t directly retrieve retweets of a specific tweet in v2 API, I had to get all retweets of a specific user (users of the tweets in my tid file) and then filter out the retweets of the specific tweets we’re interested in. Here is the command I used to get all retweets of users:

while read line; do twarc2 search --archive --start-time "..." --end-time "..." "retweets_of:$line"; done < usernames.txt > usersRetweets.jsonl

where usernames.txt is a text file including a list of usernames.

Note: I'm using v2 API because v1 API has a limitation of retrieving only 100 most recent retweets, but I need all retweets.

PROBLEM:

There are ~17 thousand unique usernames in my input text file and I found it takes a long time and a very large space to get all retweets of this number of users. For example, it took a couple of days and ~5 GB of space to collect retweets for only ~600 users.

QUESTION: What is the most efficient way to get all retweets in terms of time and space?

The above command retrieves all retweets of all tweets for each user, while I only need all retweets of the specific tweets in my dataset (tids.csv).

POSSIBLE SOLUTION: Here is what I’m thinking to do:

For each user in username.txt:

Retrieve all retweets of a user (setting the start time and end time to the time range I'm interested in)
Keep only the retweets that their source tweet id is found in the tid.csv file and remove the rest. This can be done by matching the id in referenced_tweets field with the ids in tid.csv file)
For each remaining retweet, extract only the fields I want in the JSON from the dictionary.
Write the dictionary into a file

How can I do it in Python (using twarc2 as a library)?

What is the most efficient way to get all retweets for a list of tweet ids using Twitter API V2

Answers (0)

Related Questions