Problem writing unicode UTF-16 data to file in python

Question

I'm working on Windows with Python 2.6.1.

I have a Unicode UTF-16 text file containing the single string Hello, if I look at it in a binary editor I see:

FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00
BOM   H     e     l     l     o     CR    LF

What I want to do is read in this file, run it through Google Translate API, and write both it and the result to a new Unicode UTF-16 text file.

I wrote the following Python script (actually I wrote something more complex than this with more error checking, but this is stripped down as a minimal test case):

#!/usr/bin/python    
import urllib
import urllib2
import sys
import codecs

def translate(key, line, lang):
    ret = ""
    print "translating " + line.strip() + " into " + lang
    url = "https://www.googleapis.com/language/translate/v2?key=" + key + "&source=en&target=" + lang + "&q=" + urllib.quote(line.strip())
    f = urllib2.urlopen(url)
    for l in f.readlines():
        if l.find("translatedText") > 0 and l.find('""') == -1:
            a,b = l.split(":")
            ret = unicode(b.strip('"'), encoding='utf-16', errors='ignore')
            break
    return ret

rd_file_name = sys.argv[1]
rd_file = codecs.open(rd_file_name, encoding='utf-16', mode="r")
rd_file_new = codecs.open(rd_file_name+".new", encoding='utf-16', mode="w")
key_file = open("api.key","r")

key = key_file.readline().strip()

for line in rd_file.readlines():
    new_line = translate(key, line, "ja")
    rd_file_new.write(unicode(line) + "
")
    rd_file_new.write(new_line)
    rd_file_new.write("
")

This gives me an almost-Unicode file with some extra bytes in it:

FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 0A 00
20 22 E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF 22 0A 00

I can see that 20 is a space, 22 is a quote, I assume that "E3" is an escape character that urllib2 is using to indicate that the next character is UTF-16 encoded??

If I run the same script but with "cs" (Czech) instead of "ja" (Japanese) as the target language, the response is all ASCII and I get the Unicode file with my "Hello" first as UTF-16 chars and then "Ahoj" as single byte ASCII chars.

I'm sure I'm missing something obvious but I can't see what. I tried urllib.unquote() on the result from the query but that didn't help. I also tried printing the string as it comes back in f.readlines() and it all looks pretty plausible, but it's hard to tell because my terminal window doesn't support Unicode properly.

Any other suggestions for things to try? I've looked at the suggested dupes but none of them seem to quite match my scenario.

Mark Ransom · Accepted Answer

I believe the output from Google is UTF-8, not UTF-16. Try this fix:

ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')

Problem writing unicode UTF-16 data to file in python

Answers (2)

Related Questions