user746461
user746461

Reputation:

file.read() returns different length in 2.7 and 3.3

I have a binary file,

file = open(fname,"Ub")
len(file.read())

In Python 3.3, it returns 1279200 which is correct. In Python 2.7, it returns 1279106.

What does this happen? What are the possible reasons?

In 2.7, how to get 1279200 bytes?

Upvotes: 3

Views: 193

Answers (3)

jfs
jfs

Reputation: 414235

In Python 3 'U' mode is ignored:

'U' mode is deprecated and will raise an exception in future versions of Python. It has no effect in Python 3. Use newline to control universal newlines mode.

'b' opens a file in binary mode so len(file.read()) returns number of bytes in the file (os.path.getsize(filename)).

In Python 2 'U' and 'b' can be combined. 'b' sets f_binary flag (no effect on reading as far as I can see) and it is passed to platform-specific fopen() (no effect on Unix). 'U' enables universal newlines mode (\r, \r\n are translated to \n). It may change number of read bytes if there are \r\n in the file.

Upvotes: 0

Michał Górny
Michał Górny

Reputation: 19243

Long story short, U and b don't go together.

Python 3 follows PEP-3116 for I/O implementation. If you look at open() implementation, you'd notice that b uses Buffered* interfaces, while universal newlines are implemented in TextIOWrapper. So, passing b simply disables the code that supports universal newlines.

In fact, this implementation of open() even fails if you try to enable binary mode and universal newlines at the same time. However, this code doesn't support the U mode argument at all, just the explicit newline parameter.

Now, I don't know why U in the actual implementation doesn't trigger the error. Maybe it's just an omission, maybe it's intended for backwards compatibility.

Now, Python 2 has two I/O implementations. If you used, io.open(), you'd get the same behavior as Python 3. However, you are using legacy open() that goes through C implementation of file type (relevant code: open(), field setting, get_line()), and this code has no explicit separation between binary and text file support. Therefore, the universal newline support is applied to binary files as well.

So, to sum up: you are trying to use two conflicting file modes. In Python 3, this should likely trigger an error but it doesn't for some reason. Instead, b is stronger than U and the latter doesn't work. In Python 2, the code had no clear split between binary and text files, and both b and U are respected, depending on the context.


A quick test:

$ printf '1\n2\r\n3\n\r4\r5' > f
$ ipython3.3
In [1]: open('f', 'Ub').read()
Out[1]: b'1\n2\r\n3\n\r4\r5'

$ ipython2.7
In [1]: import io

In [2]: io.open('f', 'Ub').read()
Out[2]: '1\n2\r\n3\n\r4\r5'

In [3]: open('f', 'Ub').read()
Out[3]: '1\n2\n3\n\n4\n5'

Upvotes: 1

ylsun
ylsun

Reputation: 31

quoting from python documation:

supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

I suggest you using 'rb'/'wb' mode, this works!

Upvotes: 0

Related Questions