Reputation: 2700

Newline characters in non ASCII encoded files

I'm using Python 2.6 to read latin2 encoded file with windows line endings ('\r\n').

import codecs

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rt')
line = file.readline()
print(repr(line))

outputs : u'login: yabcok\n'

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='r')
line = file.readline()
print(repr(line))

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rb')
line = file.readline()
print(repr(line))

outputs : u'password: l1x1%Dm\r\n'

My questions:

Why text mode is not the default? Documentation states otherwise. Is codecs module commonly used with binary files?
Why newline chars aren't stripped from readline() output? This is annoying and redundant.
Is there a way to specify newline character for files not ASCII encoded.

Upvotes: 1

Answers (2)

bobince

Reputation: 536567

mode='rt'

'rt' isn't a real mode as such - that will do the same as 'r'.

Why text mode is not the default?

See Torsten's answer.

Also, if you are using anything but Windows, text mode files behave identically to binary files anyway.

You may instead be thinking of 'U'niversal newlines mode, which attempts to allow other platforms' text-mode files to work. Whilst it is possible to pass a 'U' flag to codecs.open, given the doc as outlined above I think it's bug. Certainly the results would go wrong on UTF-16 and some East Asian codecs, so don't rely on it.

Why newline chars aren't stripped from readline() output?

It is necessary to be able to tell whether the last line of the file ends with a trailing newline.

Upvotes: 0

Torsten Marek

Reputation: 86542

Are you sure that your examples are correct? The documentation of the codecs module says:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

On my system, with a Latin-2 encoded file + DOS line endings, there's no difference between "rt", "r" and "rb" (Disclaimer: I'm using 2.5 on Linux).

The documentation for open also mentions no "t" flag, so that behavior seems a little strange.

Newline characters are not stripped from lines because not all lines returned by readline may end in newlines. If the file does not end with a newline, the last line does not carry one. (I obviously can't come up with a better explanation).

Newline characters do not differ based on the encoding (at least not among the ones which use ASCII for 0-127), only based on the platform. You can specify "U" in the mode when opening the file and Python will detect any form of newline, either Windows, Mac or Unix.

Upvotes: 3

Newline characters in non ASCII encoded files

Answers (2)

Related Questions