Dd Pp
Dd Pp

Reputation: 5977

Read binary data in R instead of unpack in python

I learned python for 8 months,newbie to R,there is a binary file ,i can read
and change the binary data into a list (in python ,array is list).
the data file (named test ) is in:
https://www.box.com/s/0g3qg2lqgmr7y7fk5aut
Structure is:
Every 4 bytes is a integer,so to read it with unpack in python

import struct
datafile=open('test','rb')
data=datafile.read(32)
result=[]
while  data:
    result.append(list(struct.unpack('iiiiiiii',data)))
    data=datafile.read(32)

How can i read the binary data in R?

I benefited from Paul Hiemstra's help to finish the code in R.

datafile="test"
totalsize=file.info(datafile)$size
lines=totalsize/32
data=readBin("test",integer(),n=totalsize,size=4,endian="little")
result=data.frame(matrix(data,nrow=lines,ncol=8,byrow=TRUE))
colnames(result)=c(date,"x1","x2","x3","x4","x5","x6","x7")

There is still problem i want to solve. Here ,i read all data totally with n=totalsize,if the data is huge ,the memory is not enough to contain ,how to express :to read data from the 1001th to 2000th byte? If n=1000,it means to read data from 1th to 1000th,if n=2000,it means to read data form 1th to 2000th, how about to read data from 1001th to 2000th? Is there file pointer in R?when i read 1000th binary data,the file pointer is on the 1000th position,now to use command readBin("test",integer(),n=1000,size=4,endian="little") to read data from 1001th to 2000th?

Upvotes: 0

Views: 4317

Answers (1)

Paul Hiemstra
Paul Hiemstra

Reputation: 60944

Googling for R read binary file yields the following link as its first result. The bottom line is to use the readBin function, which in your case would look something like:

file2read = file("test", "rb")
number_of_integers_in_file = 128
spam = readBin(file2read, integer(), number_of_integers_in_file, size = 4)
close(file2read)

If you do not know the number of integers in the file, you can do a number of things, first create an example file:

# Create a binary file that we can read
l = as.integer(1:10)
file2write = file("/tmp/test", "wb")
writeBin(l, file2write)
close(file2write)

One strategy is to overestimate the number of integers to read readBin will only return the numbers that really exist. A vector the size of n is preallocated, so take care with making this too large.

file2read = file("/tmp/test", "rb")
l_read = readBin(file2read, integer(), n = 100)
close(file2read)
all.equal(l, l_read)
[1] TRUE

Alternatively, if you know the size, e.g. 4 bytes, of the numbers, you can calculate how many are present using the following function I wrote:

number_of_numbers = function(path, size = 4) {
  # If path is a file connection, extract file name
  if(inherits(path, "file")) path = summary(path)[["description"]]
  return(file.info(path)[["size"]] / size)
 }
number_of_numbers("/tmp/test")
[1] 10

In action:

file2read = file("/tmp/test", "rb")
l_read2 = readBin(file2read, integer(), n = number_of_numbers(file2read))
close(file2read)
all.equal(l, l_read2)   
[1] TRUE

If the amount of data is too big to fit in memory, I would recommend reading in chunks. This can be done using consecutive calls to readBin, for example:

first_1000 = readBin(con, integer(), n = 1000)
next_1000 = readBin(con, integer(), n = 1000)

If you want to skip parts of the datafile, say the first 1000 numbers, use the seek function. This is much faster than reading 1000 numbers, discarding those, and reading the second 1000 numbers. For example:

# Skip the first thousand 4 byte integers
seek(con, where = 4*1000)
next_1000 = readBin(con, integer(), n = 1000)

Upvotes: 6

Related Questions