heinheo
heinheo

Reputation: 565

Read binary data into R efficiently

From a text file I'm reading in binary data structured like this:

0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101

The file has 800 lines. Every line is equally long (but that varies between files, so it doesn't make sense to hard code it). I want the input stored in a data frame, in which every line is a row, and every two numbers are stored in different columns, for example:

col1 col2 col3 col4
0      1    0    1

Currently I am doing it like this

as.matrix(read.table(text=gsub("", ' ', readLines("input"))))->g

However, that takes too long as there are roughly 70,000 0/1's in each line.

Is there a quicker way to do this?

Upvotes: 4

Views: 651

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46866

From a subsequent question, from the structure of the data, and from the original solution it seems that you'd like a matrix (since all columns are of the same type) rather than a data.frame indicated in the body of the question (and causing problems down-stream!). The data doesn't seem too big, so read it in and split into individual letters

lns = strsplit(readLines("somefile.txt"), "")

Then unlist, match strings to integer, and reshape as matrix

v = match(unlist(lns), c("0", "1")) - 1L
m = matrix(v, nrow=length(lns), byrow=TRUE)

Or as a function

input2matrix <- function(fname) {
    lns = strsplit(readLines("somefile.txt"), "")
    v = match(unlist(lns), c("0", "1")) - 1L
    matrix(v, nrow=length(lns), byrow=TRUE)
}

This takes about 5s for the 800 x 70000 line example. From comparison with other responses, it is also faster than all other solutions (I couldn't get iotools to install easily, complaining about C-level missing symbol Rspace) and does not making assumptions about OS and availability of OS tools (and knowledge of those tools in addition to R!).

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193537

I would recommend exploring read_fwf from the "readr" package. You can do something like this:

library(readr)
len <- nchar(readLines("yourfile.txt", n = 1))
read_fwf("yourfile.txt", fwf_widths(rep(1, len)))

Alternatively, you can try the "iotools" package, which might be faster:

library(iotools)
len <- nchar(readLines("yourfile.txt", n = 1))
input.file("yourfile.txt", formatter = dstrfw, 
            col_types = rep("integer", len), widths = rep(1, len))

Here's a small POC:

a <- tempfile()

writeLines("0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101", a)

len <- nchar(readLines(a, n = 1))

library(readr)
read_fwf(a, fwf_widths(rep(1, len)))
# Source: local data frame [3 x 28]
# 
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28
# 1  0  1  0  1  0  1  0  1  0   0   1   0   1   0   1   0   1   0   1   0   1   0   1   0   1   0   1   0
# 2  1  0  1  0  1  0  1  0  0   1   0   1   0   1   0   1   0   1   0   1   0   1   0   1   0   1   1   1
# 3  1  1  1  1  1  0  1  0  1   0   1   0   1   0   1   0   1   0   0   1   0   1   0   1   0   1   0   1

The dimensions of your data does seem to make read_fwf choke. I did a small test to compare the "iotools" approach with awk + fread.

Here is the sample data:

## Creates a file named "somefile.txt"
set.seed(1)
A <- replicate(10, sample(0:1, 70000, TRUE), FALSE)
A <- sapply(A, paste, collapse = "")
writeLines(rep(A, 800/length(A)), "somefile.txt")

Here are the functions and results. I've written the functions such that you should be able to try them on your actual data to see which works out best for you.

Obviously, it seems like readr is out of the picture at this stage :-)

Freadr <- function(infile = "somefile.txt") {
  len <- nchar(readLines(infile, n = 1))
  read_fwf(infile, fwf_widths(rep(1, len)))
}
system.time(temp1 <- Freadr())
# |===============================================================| 100%   53 MB
#    user  system elapsed 
# 466.740   0.384 466.506 

Fiotools <- function(infile = "somefile.txt") {
  len <- nchar(readLines(infile, n = 1))
  input.file(infile, formatter = dstrfw, 
             col_types = rep("integer", len), widths = rep(1, len))
}
system.time(temp2 <- Fiotools())
#    user  system elapsed 
#   7.248   0.016   7.273 

Fawk <- function(infile = "somefile.txt") {
  cmd <- sprintf("awk '{gsub(/./,\"&,\", $1);print $1}' %s", infile)
  fread(cmd)
}
system.time(temp3 <- Fawk())
#    user  system elapsed 
#  12.948   0.156  13.109 

For that matter, using base R is not too bad either:

fun4 <- function(infile = "somefile.txt") {
  do.call(rbind, lapply(strsplit(readLines(infile), "", TRUE), as.numeric))
}
system.time(fun4())
#    user  system elapsed 
#   9.056   0.260   9.304 

The result there is a matrix, so you may need to add a couple of seconds for conversion to a data.frame or a data.table if that's really what you want.

Upvotes: 6

akrun
akrun

Reputation: 887241

You could pipe with awk

read.table(pipe("awk '{gsub(/./,\"& \", $1);print $1}' yourfile.txt"))
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
#1  0  1  0  1  0  1  0  1  0   0   1   0   1   0   1   0   1   0   1   0   1
#2  1  0  1  0  1  0  1  0  0   1   0   1   0   1   0   1   0   1   0   1   0
#3  1  1  1  1  1  0  1  0  1   0   1   0   1   0   1   0   1   0   0   1   0
#  V22 V23 V24 V25 V26 V27 V28
#1   0   1   0   1   0   1   0
#2   1   0   1   0   1   1   1
#3   1   0   1   0   1   0   1

Or

read.table(pipe("awk '{gsub(\"\",\" \", $1);print $1}' yourfile.txt"))

fread can also be combined with awk

library(data.table)
fread("awk '{gsub(/./,\"&,\", $1);print $1}' yourfile.txt")

Using a similar dataset as the OP's dataset,

library(stringi)
write.table(stri_rand_strings(800,70000, '[0-1]'), file='binary1.txt',
         row.names=FALSE, quote=FALSE, col.names=FALSE)

system.time(fread("awk '{gsub(/./,\"&,\", $1);print $1}' binary1.txt"))
#  user  system elapsed 
#16.444   0.108  16.542 

Upvotes: 7

Related Questions