Reputation: 5977
I learned python for 8 months,newbie to R,there is a binary file ,i can read
and change the binary data into a list (in python ,array is list).
the data file (named test ) is in:
https://www.box.com/s/0g3qg2lqgmr7y7fk5aut
Structure is:
Every 4 bytes is a integer,so to read it with unpack in python
import struct
datafile=open('test','rb')
data=datafile.read(32)
result=[]
while data:
result.append(list(struct.unpack('iiiiiiii',data)))
data=datafile.read(32)
How can i read the binary data in R?
datafile="test"
totalsize=file.info(datafile)$size
lines=totalsize/32
data=readBin("test",integer(),n=totalsize,size=4,endian="little")
result=data.frame(matrix(data,nrow=lines,ncol=8,byrow=TRUE))
colnames(result)=c(date,"x1","x2","x3","x4","x5","x6","x7")
There is still problem i want to solve. Here ,i read all data totally with n=totalsize,if the data is huge ,the memory is not enough to contain ,how to express :to read data from the 1001th to 2000th byte? If n=1000,it means to read data from 1th to 1000th,if n=2000,it means to read data form 1th to 2000th, how about to read data from 1001th to 2000th? Is there file pointer in R?when i read 1000th binary data,the file pointer is on the 1000th position,now to use command readBin("test",integer(),n=1000,size=4,endian="little") to read data from 1001th to 2000th?
Upvotes: 0
Views: 4317
Reputation: 60944
Googling for R read binary file
yields the following link as its first result. The bottom line is to use the readBin
function, which in your case would look something like:
file2read = file("test", "rb")
number_of_integers_in_file = 128
spam = readBin(file2read, integer(), number_of_integers_in_file, size = 4)
close(file2read)
If you do not know the number of integers in the file, you can do a number of things, first create an example file:
# Create a binary file that we can read
l = as.integer(1:10)
file2write = file("/tmp/test", "wb")
writeBin(l, file2write)
close(file2write)
One strategy is to overestimate the number of integers to read readBin will only return the numbers that really exist. A vector the size of n
is preallocated, so take care with making this too large.
file2read = file("/tmp/test", "rb")
l_read = readBin(file2read, integer(), n = 100)
close(file2read)
all.equal(l, l_read)
[1] TRUE
Alternatively, if you know the size, e.g. 4 bytes, of the numbers, you can calculate how many are present using the following function I wrote:
number_of_numbers = function(path, size = 4) {
# If path is a file connection, extract file name
if(inherits(path, "file")) path = summary(path)[["description"]]
return(file.info(path)[["size"]] / size)
}
number_of_numbers("/tmp/test")
[1] 10
In action:
file2read = file("/tmp/test", "rb")
l_read2 = readBin(file2read, integer(), n = number_of_numbers(file2read))
close(file2read)
all.equal(l, l_read2)
[1] TRUE
If the amount of data is too big to fit in memory, I would recommend reading in chunks. This can be done using consecutive calls to readBin
, for example:
first_1000 = readBin(con, integer(), n = 1000)
next_1000 = readBin(con, integer(), n = 1000)
If you want to skip parts of the datafile, say the first 1000 numbers, use the seek
function. This is much faster than reading 1000 numbers, discarding those, and reading the second 1000 numbers. For example:
# Skip the first thousand 4 byte integers
seek(con, where = 4*1000)
next_1000 = readBin(con, integer(), n = 1000)
Upvotes: 6