Reputation: 1061
I have a list of twitter usernames containing more than 500K in number. I could develop a program that uses twython and API secret keys. The program and Inputs are too large to put here hence uploaded in the Github
The program runs fine for usernames around 150 in number but not more than that. The limitation makes it impossible to scrape geo locations for the 500K+ usernames.
I am seeking some help in bypassing the API and may be use web scraping technique or any other better alternative to scrape geo locations of usernames.
Every Help Appreciated :)
Upvotes: 0
Views: 272
Reputation: 10431
What I would do is scrap twitter.com/ instead of using twitter API.
The main reason is frontend is not query limited (or at least way less limited) and even if you needs to call twitter too much time by seconds, you can play with User-Agent and proxy to not be spotted.
So for me, scrapping is the easiest way to bypass API limit.
Moreover what you need to crawl is really easy to access, I made a simple'n'dirty code that parse your csv file and output location of users.
I will make a PR on your repo for fun, but here is the code:
#!/usr/env/bin python
import urllib2
from bs4 import BeautifulSoup
with open('00_Trump_05_May_2016.csv', 'r') as csv:
next(csv)
for line in csv:
line = line.strip()
permalink = line.split(',')[-1].strip()
username = line.split(',')[0]
userid = permalink.split('/')[3]
page_url = 'http://twitter.com/{0}'.format(userid)
try:
page = urllib2.urlopen(page_url)
except urllib2.HTTPError:
print 'ERROR: username {} not found'.format(username)
content = page.read()
html = BeautifulSoup(content)
location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
print 'username {0} ({1}) located in {2}'.format(username, userid, location)
Output:
username cenkuygur (cenkuygur) located in Los Angeles
username ilovetrumptards (ilovetrumptards) located in
username MorganCarlston hanifzk (MorganCarlston) located in
username mitchellvii (mitchellvii) located in Charlotte, NC
username MissConception0 (MissConception0) located in #UniteBlue in Semi-Red State
username HalloweenBlogs (HalloweenBlogs) located in Los Angeles, California
username bengreenman (bengreenman) located in Fiction and Non-Fiction Both
...
Obviously you should update this code to make it more robust, but the basics are done.
PS: I parse 'permalink' field because it store well formatted slug to use in order to reach profil's page. It's pretty dirty, but quick & it works
About google API, I surelly would use a kind of cache / database to avoid to much google calls.
In python, without db you can just make a dict like:
{
"San Fransisco": [x.y, z.a],
"Paris": [b.c, d.e],
}
And for each location to parse I would first check in this dict if key exists, if yes just take my value from here, else call google API and then save values in db dict.
I think with this two ways of doing you will be able to get your data.
Upvotes: 2