paul23
paul23

Reputation: 9425

numpy.array.tofile() binary file looks "strange" in notepad++

I am just wondering how the function actually stores the data. Because to me, it looks completely strange. Say I have the following code:

import numpy as np
filename = "test.dat"
print(filename)
fileobj = open(filename, mode='wb')
off = np.array([1, 300], dtype=np.int32)
off.tofile(fileobj)
fileobj.close()

fileobj2 = open(filename, mode='rb')
off = np.fromfile(fileobj2, dtype = np.int32)
print(off)
fileobj2.close()

Now I expect 8 bytes inside the file, where each element is represented by 4 bytes (and I could live with any endianness). However when I open up the file in a hex editor (used notepad++ with hex editor plugin) I get the following bytes:

01 00 C4 AC 00

5 bytes, and I have no idea at all what it represents. The first byte looks like it is the number, but then what follows is something weird, certainly not "300".

Yet reloading shows the original array.

Is this something I don't understand in python, or is it a problem in notepad++? - I notice the hex looks different if I select a different "encoding" (huh?). Also: Windows does report it being 8 bytes long.

Upvotes: 0

Views: 21803

Answers (2)

abarnert
abarnert

Reputation: 365577

You can tell very easily that the file actually does have 8 bytes, the same 8 bytes you'd expect (01 00 00 00 2C 01 00 00) just by using anything other than Notepad++ to look at the file, including just replacing your off = np.fromfile(fileobj2, dtype=np.int32) with off = fileobj2.read() then printing the bytes (which will give you b'\x01\x00\x00\x00,\x01\x00\x00'1)).

And, from your comments, after I suggested that, you tried it, and saw exactly that.

Which means this is either a bug in Notepad++, or a problem with the way you're using it; Python, NumPy, and your own code are perfectly fine.


1) In case it isn't clear: '\x2c' and ',' are the same character, and bytes uses the printable ASCII representation for printable ASCII characters, as well as familiar escapes like '\n', when possible, only using the hex backslash escape for other values.

Upvotes: 2

hpaulj
hpaulj

Reputation: 231325

What are you expecting 300 to look like?

Write the array, and read it back as binary (in ipython):

In [478]: np.array([1,300],np.int32).tofile('test')

In [479]: with open('test','rb') as f: print(f.read())
b'\x01\x00\x00\x00,\x01\x00\x00'

There are 8 bytes, , is just a displayable byte.

Actually, I don't have to go through a file to get this:

In [505]: np.array([1,300]).tostring()
Out[505]: b'\x01\x00\x00\x00,\x01\x00\x00'

Do the same with:

[255]    
b'\xff\x00\x00\x00'

[256]
b'\x00\x01\x00\x00'

[300]
b',\x01\x00\x00'

[1,255]
b'\x01\x00\x00\x00\xff\x00\x00\x00'

With powers of 2 (and 1 less) it is easy to identify a pattern in the bytes.


frombuffer converts a byte string back to an array:

In [513]: np.frombuffer(np.array([1,300]).tostring(),int)
Out[513]: array([  1, 300])

In [514]: np.frombuffer(np.array([1,300]).data,int)
Out[514]: array([  1, 300])

Judging from this last expression, the tofile is just writing the array buffer to the file as bytes.

Upvotes: 1

Related Questions