Reputation: 93
I'm scraping some data from Google Translate like so:
import urllib
import mechanize
get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")
browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('User-agent','Chrome')]
translate_text=urllib.urlopen(get_url).read()
print translate_text
Which gives me the following output:
[["Hellow Mundial", "Hellow World"]]
undefined
"en"
undefined
undefined
[["Hellow", 1,…], ["World", 2,…]]
0.022165652
undefined
[["en"], undefined, [0.022165652]]
Which can be seen here:
So I try to split the data on the ]] so my output will only be:
[["Hellow Mundial", "Hellow World"]]
I'm splitting the data like so:
translate_text=translate_text.split("]]")
print translate_text[0]
However, when I run this I get the page markup. Before the split, I got the query result. How come the split is causing this and not splitting the string as intended?
Upvotes: 1
Views: 77
Reputation: 180502
You could extract the first list with a regex:
get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")
import requests
r = requests.get(get_url)
import re
print(re.search("\[(\"(.*?)\")\]",r.content).group(1))
"Hello World como estas","Hello World how are you"
If you want the names in variables:
a ,b = re.search("\[(\"(.*?)\")\]",r.text).group(1).split(",")
print(a,b)
"Hello World como estas" "Hello World how are you"
If you really want a list you can use ast.literal_eval after getting the first list with re:
import re
from ast import literal_eval
print(literal_eval(re.search("\[(\"(.*?)\")\]",r.text).group(0)))
['Hello World como estas', 'Hello World how are you']
If you run the code in your browser it actually downloads as a .txt file.
Upvotes: 0
Reputation: 59671
Google is returning something similar toJSON
format (but not actually JSON) for you which can be very easily parsed after a simple RegEx to replace any consecutive commas with a single one:
Try:
import json
import re
# replace any consecutive commas with a single one
translate_text = re.sub( ',+', ',', translate_text ).strip()
arr = json.loads(translate_text)
print arr[0][0][0] # prints "Hellow Mundial"
print arr[0][0][1] # prints "Hellow World"
Note that translate_text
is a string, and arr
is a Python array. json.loads
was able to parse into a native Python format for you so that you could use simple List and Dictionary look-ups.
Upvotes: 3
Reputation: 9285
I think the string you want to use is in JSON format, I suggest you to parse it with json
lib:
>>> import json
>>> json.loads('[["Hellow Mundial", "Hellow World"]]')
[[u'Hellow Mundial', u'Hellow World']]
The JSON will be translated into Python objects (currently here list of list):
>>> l = json.loads('[["Hellow Mundial", "Hellow World"]]')
>>> l[0]
["Hellow Mundial", "Hellow World"]
>>> l[0][0]
"Hellow Mundial"
Upvotes: 0
Reputation: 5708
Those ]]
you see are not a part of the actual string. they are placed there by Python to indicate that the stuff inside the []
and delimited by ,
are elements of an array.
In your case, the first element of the array is a 2D array whose first dimension only contains one element. That element is itself an array containing two strings.
If I understand your question correctly, you don't need to split anything at all. Try simply typing:
print translate_text[0]
without the split.
Upvotes: 0