Reputation: 627
I am trying to create a script (purely learning purposes) to translate a given word with a few different dictionaries. I worked my way through two, using urllib2 and beautifulsoup to get and parse the translations, then moved on to google translate.
I quickly found it returns 403 forbidden error. Adding a user agent gets the translation, but only a one word translation. To illustrate, go to http://translate.google.com/?text=test&sl=en&tl=es and you will get both the translation (in a class titled 'hps') and a list of verbs, nouns and adjectives. But use the script below and the html is different, with the main translation only being returned, and in a
span id=result_box
The verbs, nouns etc are not to be found.
During the process, and quite a bit of googling, i became aware that there is now an API - and not a free one at that. I do not intend to post any eventual script, nor to use it to violate any TOS, but am now mostly intrigued as to why the difference between the browser and urllib etc.
To that effect i have tried pure urllib2 with user agents, and mechanize - as below. So, my question is - aside from the user agent, what else differentiates the browser and the python script? I have tried using firebug, but nothing jumped out at me (albeit im a noob with it). Thanks!
edit: The request headers from firebug, and my script are below.
import mechanize
import re
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://translate.google.com/?text=test&sl=en&tl=es')
html = r.read()
match = re.findall(r'verb', html)
print match
Firebug:
GET /?text=test&sl=en&tl=es HTTP/1.1
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding gzip, deflate
Accept-Language en-us,en;q=0.5
Connection keep-alive
Cookie PREF=ID=298b435815ef8553:U=e7dad4baf65f083b:FF=0:LD=en:CR=2:TM=1327516863:LM=1339428154:S=maktYFZEHXXpMDFg; NID=60=U229h4lzOnjpHyidbhgYecCx72Myp_-XHgupW-R_mWtpuOveDdIOO1uLBq-6ltn-ER15ppJryR7yYOYEhkCfUCl45qNz5aymBQ1CGDHS4UcHu2oIDYAHut0ctnlL76eDW3n7kjOWoz5wNH6NMw
Host translate.google.com
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20100101 Firefox/9.0
Script:
send: 'GET /?text=test&sl=en&tl=es HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: translate.google.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Mon, 11 Jun 2012 16:13:42 GMT
header: Expires: Fri, 01 Jan 1990 00:00:00 GMT
header: Cache-Control: no-cache, must-revalidate
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Type: text/html; charset=UTF-8
header: Content-Language: en
header: Set-Cookie: PREF=ID=6dd42f2264250d7c:TM=1333431222:LM=1339454222:S=k6JXSoGGaAMNmPEo; expires=Wed, 11-Jun-2014 16:13:42 GMT; path=/; domain=.google.com
header: Set-Cookie: NID=60=f8czmR413h3sKUGJUUM4PLKl2O7SUtqfW5hss5O54sRKoErf9wIEU4Wu2WCuHzWTJQ3p1Rj7dQv1B4BBmSMY1tmfus7UZGCYFIKaXoKwklZ9tZsr5vds8vvvFjRdZyevn; expires=Tue, 11-Dec-2012 16:13:42 GMT; path=/; domain=.google.com; HttpOnly
header: P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
header: X-Content-Type-Options: nosniff
header: Server: HTTP server (unknown)
header: X-XSS-Protection: 1; mode=block
header: Connection: close
Upvotes: 1
Views: 632
Reputation: 44444
The verbs, adjectives are not being found because, they are loaded via an AJAX call. Your mechanize browser does not have javascript. It is hence unable to do any AJAX. However, if you could look into your browser's inspector or something, you would see the headers, the URL, and parameters of the call. All that remains to do now, is to imitate the call.
I curl'ed it, and I got a JSON response:
thrustmaster@thrustmaster:~/Temp$ curl 'http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1' -H 'User-Agent: blah'
[[["prueba","test","",""]],[["noun",["prueba","ensayo","test","examen","an�lisis","criterio","toque","ejercicio","tanteo"],[["prueba",["test","proof","evidence","trial","event","race"]],["ensayo",["test","trial","essay","assay","testing","rehearsal"]],["test",["test"]],["examen",["examination","review","exam","test","inspection","quiz"]],["an�lisis",["analysis","test","review","assay","breakdown"]],["criterio",["criterion","judgment","standard","test","view","yardstick"]],["toque",["touch","stroke","test","knock","blast","chime"]],["ejercicio",["exercise","practice","drill","practicing","test","prosecution"]],["tanteo",["score","scoring","trial","test","try","calculation"]]]],["adjective",["de prueba"],[["de prueba",["test","testing","trial","probationary","corrective"]]]],["verb",["probar","comprobar","ensayar","examinar","poner a prueba","experimentar","someter a prueba","interrogar","hacer investigaciones","justificar","graduar"],[["probar",["test","try","prove","taste","try out","sample"]],["comprobar",["check","test","prove","ascertain","make sure","substantiate"]],["ensayar",["test","try","rehearse","try out","assay","essay"]],["examinar",["examine","consider","review","look at","explore","test"]],["poner a prueba",["test","try","try out","prove","tempt","put through his paces"]],["experimentar",["experience","experiment","undergo","experiment with","feel","test"]],["someter a prueba",["test","try out","touch"]],["interrogar",["question","interrogate","examine","cross-examine","ask","test"]],["hacer investigaciones",["test"]],["justificar",["justify","warrant","substantiate","prove","make good","test"]],["graduar",["graduate","grade","calibrate","time","test"]]]]],"en",,[["prueba",[5],1,0,1000,0,1,0]],[["test",4,,,""],["test",5,[["prueba",1000,1,0],["prueba de",0,1,0],["ensayo",0,1,0],["de prueba",0,1,0],["test",0,1,0]],[[0,4]],"test"]],,,[["en"]],5]thrustmaster@thrustmaster:~/Temp$
Now, may be in your script, you have to get the response from the URL below:
PS:
This might be a TOS issue, as you said, in case you are planning to use this script. Its always a better choice to use on APIs. The HTML you are relying on can change anytime.
Upvotes: 1