Reputation: 1462
I have a large database of twitter users (about 6 million). I have the user ids, login handle, their recent tweets, contact details, location etc.
I want to build a user-follower list out of these. Basically I want to create another table which will have two-columns - 1) User-ID (the id of the user which I have) 2) Follower-ID (the ids, separated by a semicolon - of all the followers of this user)
For example: if an user with id 001 is is being followed users with ids 002,003 the record would look like this -
User-ID - 001 Follower-ID - 002;003
I want to achieve this using Java preferably, but I am open to other languages as well.
I tried using twitter4j - a Java library to get tweets, users etc - but it has a limitation on number of API calls a day. Using the twitter SEARCH OR REST API is not possible since it does not give me the ids of the followers of a particular user.
Another way my professor suggested me - to crawl the webpages of Twitter. For example - if a user handle is xxx then I need to crawl following link -
https://twitter.com/xxx/followers
Get this webpage and parse HTML to get follower IDs. I checked the webpage using Firebug and I could see the IDs of all followers !!
The problem here is - How do I do it for 6 million users that I have? (I have the handles, so I just need to crawl the link mentioned above, replacing xxx with next handle)
I was trying to use Crawleer4j - a web crawler to crawl twitter pages, but since Twitter has increased their security - this is also not possible.
How can I do this? Please help - I am doing this as a part of my research project and I am really stuck here.
I want to find out a way by which I can crawl Twitter web pages to get this required information.
Please help !
Upvotes: 4
Views: 4545
Reputation: 787
I would start with the links below. It can be done but it is going to take a a considerable amount of time.
https://dev.twitter.com/docs/api/1.1/get/followers/ids
https://dev.twitter.com/docs/api/1.1/get/friends/ids
Consider that Justin Beiber has 40,000,000 followers so pulling with one token would take 5 1/2 days.
40,000,000 (followers) / 5,000 (records returned in a call) / 15 (max rest calls in a 15 min period)/ 4 (15 mins intervals in 1 hr) = 133 hours
Upvotes: 3