Reputation:

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.

I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.

Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.

Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?

A minimal example of the code I use is:

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.

I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...

Many thanks in advance.
Cheers,
Britta

Upvotes: 2

Answers (3)

bendin

Reputation: 9574

Please, first, read this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Then, come back and ask questions.

Upvotes: 1

Aaron Digulla

Reputation: 328754

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".

If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Upvotes: 0

Stephan202

Reputation: 61569

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Upvotes: 4

Reading &quot;raw&quot; Unicode-strings in Python

Answers (3)

Related Questions

Reading "raw" Unicode-strings in Python