joe martin
joe martin

Reputation: 93

Having trouble splitting a string

I'm scraping some data from Google Translate like so:

import urllib
import mechanize

get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")

browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('User-agent','Chrome')]

translate_text=urllib.urlopen(get_url).read()
print translate_text

Which gives me the following output:

[["Hellow Mundial", "Hellow World"]]
undefined
"en"
undefined
undefined
[["Hellow", 1,…], ["World", 2,…]]
0.022165652
undefined
[["en"], undefined, [0.022165652]]

Which can be seen here:

enter image description here

So I try to split the data on the ]] so my output will only be:

[["Hellow Mundial", "Hellow World"]]

I'm splitting the data like so:

translate_text=translate_text.split("]]")
print translate_text[0]

However, when I run this I get the page markup. Before the split, I got the query result. How come the split is causing this and not splitting the string as intended?

Upvotes: 1

Views: 77

Answers (4)

Padraic Cunningham
Padraic Cunningham

Reputation: 180502

You could extract the first list with a regex:

get_url=("https://translate.google.ie/translate_a/single?client=t&sl=auto&tl=es&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&source=btn&ssel=0&tsel=3&kc=0&tk=520887|911740&q=Hellow%20World")

import requests
r = requests.get(get_url)

import re

print(re.search("\[(\"(.*?)\")\]",r.content).group(1))

"Hello World como estas","Hello World how are you"

If you want the names in variables:

a ,b  = re.search("\[(\"(.*?)\")\]",r.text).group(1).split(",")
print(a,b)
"Hello World como estas" "Hello World how are you"

If you really want a list you can use ast.literal_eval after getting the first list with re:

import re
from ast import literal_eval
print(literal_eval(re.search("\[(\"(.*?)\")\]",r.text).group(0)))
['Hello World como estas', 'Hello World how are you']

If you run the code in your browser it actually downloads as a .txt file.

Upvotes: 0

Martin Konecny
Martin Konecny

Reputation: 59671

Google is returning something similar toJSON format (but not actually JSON) for you which can be very easily parsed after a simple RegEx to replace any consecutive commas with a single one:

Try:

import json
import re

# replace any consecutive commas with a single one
translate_text = re.sub( ',+', ',', translate_text ).strip()
arr = json.loads(translate_text)
print arr[0][0][0] # prints "Hellow Mundial"
print arr[0][0][1] # prints "Hellow World"

Note that translate_text is a string, and arr is a Python array. json.loads was able to parse into a native Python format for you so that you could use simple List and Dictionary look-ups.

Upvotes: 3

Zulu
Zulu

Reputation: 9285

I think the string you want to use is in JSON format, I suggest you to parse it with json lib:

>>> import json
>>> json.loads('[["Hellow Mundial", "Hellow World"]]')
[[u'Hellow Mundial', u'Hellow World']]

The JSON will be translated into Python objects (currently here list of list):

>>> l = json.loads('[["Hellow Mundial", "Hellow World"]]')
>>> l[0]
["Hellow Mundial", "Hellow World"]
>>> l[0][0]
"Hellow Mundial"

Upvotes: 0

Luke
Luke

Reputation: 5708

Those ]] you see are not a part of the actual string. they are placed there by Python to indicate that the stuff inside the [] and delimited by , are elements of an array.

In your case, the first element of the array is a 2D array whose first dimension only contains one element. That element is itself an array containing two strings.

If I understand your question correctly, you don't need to split anything at all. Try simply typing:

print translate_text[0]

without the split.

Upvotes: 0

Related Questions