Converting accented/special characters from plain-text file to LaTeX representation using Python

Question

I have to read through a plain-text (UTF-8) file line-by-line and convert it into a .tex file (just another plain-text file with markup) for processing by a LaTeX processor.

One of the things I want to do is to convert special characters like é into their LaTeX representation: \'e

So I wrote:

with open(input, "r") as in_file, open(output, "w") as out_file:
        for line in in_file:
                # Other code here
                line.replace('é', "\'e") # This fails as below
                # Other code here
                out_file.write(line)

running the script on an input file gives:

    line.replace('é', "\'e")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

So clearly the interpreter is using the ascii codec. Why?

Instead of the normal open(...) I also tried codecs.open(input, "r", "utf-8") and similarly for the output file, but get the same error.

Before running line.replace(...) I also tried using each of the following lines in turn (not both together, first one, then the other) to convert line to a unicode string:

line = unicode(line, "utf-8")
line = line.decode("utf-8")

but get exactly the same error.

What's the proper way to do it?

Update 1: I had already added # -*- coding: UTF-8 -*- as the second line to the .py file before asking this question. Without it the interpreter would give the following error upon trying to run the script:

SyntaxError: Non-ASCII character '\xc3' in file  on line 46, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

Converting accented/special characters from plain-text file to LaTeX representation using Python

Answers (1)

Related Questions