commentator8
commentator8

Reputation: 627

urllib2, mechanise not returning same as browser - what else to spoof?

I am trying to create a script (purely learning purposes) to translate a given word with a few different dictionaries. I worked my way through two, using urllib2 and beautifulsoup to get and parse the translations, then moved on to google translate.

I quickly found it returns 403 forbidden error. Adding a user agent gets the translation, but only a one word translation. To illustrate, go to http://translate.google.com/?text=test&sl=en&tl=es and you will get both the translation (in a class titled 'hps') and a list of verbs, nouns and adjectives. But use the script below and the html is different, with the main translation only being returned, and in a

span id=result_box

The verbs, nouns etc are not to be found.

During the process, and quite a bit of googling, i became aware that there is now an API - and not a free one at that. I do not intend to post any eventual script, nor to use it to violate any TOS, but am now mostly intrigued as to why the difference between the browser and urllib etc.

To that effect i have tried pure urllib2 with user agents, and mechanize - as below. So, my question is - aside from the user agent, what else differentiates the browser and the python script? I have tried using firebug, but nothing jumped out at me (albeit im a noob with it). Thanks!

edit: The request headers from firebug, and my script are below.

import mechanize
import re
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://translate.google.com/?text=test&sl=en&tl=es')
html = r.read()
match = re.findall(r'verb', html)

print match

Firebug:

GET /?text=test&sl=en&tl=es HTTP/1.1

Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset  ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding gzip, deflate
Accept-Language en-us,en;q=0.5
Connection  keep-alive
Cookie  PREF=ID=298b435815ef8553:U=e7dad4baf65f083b:FF=0:LD=en:CR=2:TM=1327516863:LM=1339428154:S=maktYFZEHXXpMDFg; NID=60=U229h4lzOnjpHyidbhgYecCx72Myp_-XHgupW-R_mWtpuOveDdIOO1uLBq-6ltn-ER15ppJryR7yYOYEhkCfUCl45qNz5aymBQ1CGDHS4UcHu2oIDYAHut0ctnlL76eDW3n7kjOWoz5wNH6NMw
Host    translate.google.com
User-Agent  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20100101 Firefox/9.0

Script:

send: 'GET /?text=test&sl=en&tl=es HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: translate.google.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Mon, 11 Jun 2012 16:13:42 GMT

header: Expires: Fri, 01 Jan 1990 00:00:00 GMT

header: Cache-Control: no-cache, must-revalidate

header: Pragma: no-cache

header: X-Frame-Options: SAMEORIGIN

header: Content-Type: text/html; charset=UTF-8

header: Content-Language: en

header: Set-Cookie: PREF=ID=6dd42f2264250d7c:TM=1333431222:LM=1339454222:S=k6JXSoGGaAMNmPEo; expires=Wed, 11-Jun-2014 16:13:42 GMT; path=/; domain=.google.com

header: Set-Cookie: NID=60=f8czmR413h3sKUGJUUM4PLKl2O7SUtqfW5hss5O54sRKoErf9wIEU4Wu2WCuHzWTJQ3p1Rj7dQv1B4BBmSMY1tmfus7UZGCYFIKaXoKwklZ9tZsr5vds8vvvFjRdZyevn; expires=Tue, 11-Dec-2012 16:13:42 GMT; path=/; domain=.google.com; HttpOnly

header: P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."

header: X-Content-Type-Options: nosniff

header: Server: HTTP server (unknown)

header: X-XSS-Protection: 1; mode=block

header: Connection: close

Upvotes: 1

Views: 632

Answers (1)

UltraInstinct
UltraInstinct

Reputation: 44444

The verbs, adjectives are not being found because, they are loaded via an AJAX call. Your mechanize browser does not have javascript. It is hence unable to do any AJAX. However, if you could look into your browser's inspector or something, you would see the headers, the URL, and parameters of the call. All that remains to do now, is to imitate the call.

I curl'ed it, and I got a JSON response:

thrustmaster@thrustmaster:~/Temp$ curl 'http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1' -H 'User-Agent: blah'
[[["prueba","test","",""]],[["noun",["prueba","ensayo","test","examen","an�lisis","criterio","toque","ejercicio","tanteo"],[["prueba",["test","proof","evidence","trial","event","race"]],["ensayo",["test","trial","essay","assay","testing","rehearsal"]],["test",["test"]],["examen",["examination","review","exam","test","inspection","quiz"]],["an�lisis",["analysis","test","review","assay","breakdown"]],["criterio",["criterion","judgment","standard","test","view","yardstick"]],["toque",["touch","stroke","test","knock","blast","chime"]],["ejercicio",["exercise","practice","drill","practicing","test","prosecution"]],["tanteo",["score","scoring","trial","test","try","calculation"]]]],["adjective",["de prueba"],[["de prueba",["test","testing","trial","probationary","corrective"]]]],["verb",["probar","comprobar","ensayar","examinar","poner a prueba","experimentar","someter a prueba","interrogar","hacer investigaciones","justificar","graduar"],[["probar",["test","try","prove","taste","try out","sample"]],["comprobar",["check","test","prove","ascertain","make sure","substantiate"]],["ensayar",["test","try","rehearse","try out","assay","essay"]],["examinar",["examine","consider","review","look at","explore","test"]],["poner a prueba",["test","try","try out","prove","tempt","put through his paces"]],["experimentar",["experience","experiment","undergo","experiment with","feel","test"]],["someter a prueba",["test","try out","touch"]],["interrogar",["question","interrogate","examine","cross-examine","ask","test"]],["hacer investigaciones",["test"]],["justificar",["justify","warrant","substantiate","prove","make good","test"]],["graduar",["graduate","grade","calibrate","time","test"]]]]],"en",,[["prueba",[5],1,0,1000,0,1,0]],[["test",4,,,""],["test",5,[["prueba",1000,1,0],["prueba de",0,1,0],["ensayo",0,1,0],["de prueba",0,1,0],["test",0,1,0]],[[0,4]],"test"]],,,[["en"]],5]thrustmaster@thrustmaster:~/Temp$ 

Now, may be in your script, you have to get the response from the URL below:

http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1

PS:

This might be a TOS issue, as you said, in case you are planning to use this script. Its always a better choice to use on APIs. The HTML you are relying on can change anytime.

Upvotes: 1

Related Questions