Kedar Joshi
Kedar Joshi

Reputation: 1462

Crawl Twitter Users and Followers Data

I have a large database of twitter users (about 6 million). I have the user ids, login handle, their recent tweets, contact details, location etc.

I want to build a user-follower list out of these. Basically I want to create another table which will have two-columns - 1) User-ID (the id of the user which I have) 2) Follower-ID (the ids, separated by a semicolon - of all the followers of this user)

For example: if an user with id 001 is is being followed users with ids 002,003 the record would look like this -

User-ID - 001 Follower-ID - 002;003

I want to achieve this using Java preferably, but I am open to other languages as well.

I tried using twitter4j - a Java library to get tweets, users etc - but it has a limitation on number of API calls a day. Using the twitter SEARCH OR REST API is not possible since it does not give me the ids of the followers of a particular user.

Another way my professor suggested me - to crawl the webpages of Twitter. For example - if a user handle is xxx then I need to crawl following link -

https://twitter.com/xxx/followers

Get this webpage and parse HTML to get follower IDs. I checked the webpage using Firebug and I could see the IDs of all followers !!

The problem here is - How do I do it for 6 million users that I have? (I have the handles, so I just need to crawl the link mentioned above, replacing xxx with next handle)

I was trying to use Crawleer4j - a web crawler to crawl twitter pages, but since Twitter has increased their security - this is also not possible.

How can I do this? Please help - I am doing this as a part of my research project and I am really stuck here.

I want to find out a way by which I can crawl Twitter web pages to get this required information.

Please help !

Upvotes: 4

Views: 4545

Answers (1)

tjrburgess
tjrburgess

Reputation: 787

I would start with the links below. It can be done but it is going to take a a considerable amount of time.

https://dev.twitter.com/docs/api/1.1/get/followers/ids

https://dev.twitter.com/docs/api/1.1/get/friends/ids

Consider that Justin Beiber has 40,000,000 followers so pulling with one token would take 5 1/2 days.

40,000,000 (followers) / 5,000 (records returned in a call) / 15 (max rest calls in a 15 min period)/ 4 (15 mins intervals in 1 hr) = 133 hours

Upvotes: 3

Related Questions