Encoding issue when writing to text file, with Python

Question

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.

While there are a number of questions out there on this topic, I didn't find a direct answer to my problem. Detecting the system defaults won't help me in this case, because I need the program to be portable.

Here's the code:

def txt_to_JSON(csv_list):
    ...some manipulation of the list...
    return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list 
for i in range(0,len(lines)):
    lines[i] = lines[i].split("	")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()

Thomas Fenzl · Accepted Answer

All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet

I highly recommend Ned Batchelder's presentation http://nedbatchelder.com/text/unipain.html for details.

There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?

TLDR: Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.

Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.

In your case

filename = 'where your data lives'
with open(filename, 'rb') as f:
   encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")

# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)

encoded_result = result.encode("UTF-16")  #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
    f.write(encoded_result)

Encoding issue when writing to text file, with Python

Answers (2)

Related Questions