Reputation: 13244
I'm working on Windows with Python 2.6.1.
I have a Unicode UTF-16 text file containing the single string Hello, if I look at it in a binary editor I see:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00
BOM H e l l o CR LF
What I want to do is read in this file, run it through Google Translate API, and write both it and the result to a new Unicode UTF-16 text file.
I wrote the following Python script (actually I wrote something more complex than this with more error checking, but this is stripped down as a minimal test case):
#!/usr/bin/python
import urllib
import urllib2
import sys
import codecs
def translate(key, line, lang):
ret = ""
print "translating " + line.strip() + " into " + lang
url = "https://www.googleapis.com/language/translate/v2?key=" + key + "&source=en&target=" + lang + "&q=" + urllib.quote(line.strip())
f = urllib2.urlopen(url)
for l in f.readlines():
if l.find("translatedText") > 0 and l.find('""') == -1:
a,b = l.split(":")
ret = unicode(b.strip('"'), encoding='utf-16', errors='ignore')
break
return ret
rd_file_name = sys.argv[1]
rd_file = codecs.open(rd_file_name, encoding='utf-16', mode="r")
rd_file_new = codecs.open(rd_file_name+".new", encoding='utf-16', mode="w")
key_file = open("api.key","r")
key = key_file.readline().strip()
for line in rd_file.readlines():
new_line = translate(key, line, "ja")
rd_file_new.write(unicode(line) + "\n")
rd_file_new.write(new_line)
rd_file_new.write("\n")
This gives me an almost-Unicode file with some extra bytes in it:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 0A 00
20 22 E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF 22 0A 00
I can see that 20 is a space, 22 is a quote, I assume that "E3" is an escape character that urllib2 is using to indicate that the next character is UTF-16 encoded??
If I run the same script but with "cs" (Czech) instead of "ja" (Japanese) as the target language, the response is all ASCII and I get the Unicode file with my "Hello" first as UTF-16 chars and then "Ahoj" as single byte ASCII chars.
I'm sure I'm missing something obvious but I can't see what. I tried urllib.unquote() on the result from the query but that didn't help. I also tried printing the string as it comes back in f.readlines() and it all looks pretty plausible, but it's hard to tell because my terminal window doesn't support Unicode properly.
Any other suggestions for things to try? I've looked at the suggested dupes but none of them seem to quite match my scenario.
Upvotes: 1
Views: 3527
Reputation: 82934
Those E3 bytes are not "escape characters". If one had no access to documentation, and was forced to make a guess, the most likely suspect for the response encoding would be UTF-8. Expectation (based on a one-week holiday in Japan): something like "konnichiwa".
>>> response = "\xE3\x81\x93\xE3\x82\x93\xE3\x81\xAB\xE3\x81\xA1\xE3\x81\xAF"
>>> ucode = response.decode('utf8')
>>> print repr(ucode)
u'\u3053\u3093\u306b\u3061\u306f'
>>> import unicodedata
>>> for c in ucode:
... print unicodedata.name(c)
...
HIRAGANA LETTER KO
HIRAGANA LETTER N
HIRAGANA LETTER NI
HIRAGANA LETTER TI
HIRAGANA LETTER HA
>>>
Looks close enough to me ...
Upvotes: 2
Reputation: 308176
I believe the output from Google is UTF-8, not UTF-16. Try this fix:
ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')
Upvotes: 5