R - Read binary matrix with no separator

Question

I am trying to read a large (~100mb) binary matrix in R. This is how the plaintext looks like:

10001010
10010100
00101101

Expected output:

  V1 V2 V3 V4 V5 V6 V7 V8
r1  1  0  0  0  1  0  1  0
r2  1  0  0  1  0  1  0  0
r3  0  0  1  0  1  1  0  1

I am currently reading each line and separating the bits. Is there any more efficient way to do this?

akrun · Accepted Answer

A base R option (which could be slow) would be to scan the .txt file, split the elements by the delimiter "", convert to numeric/integer and rbind the list elements to create a matrix.

 m1 <- do.call(rbind,lapply(strsplit(scan("inpfile.txt", 
                 what=""), ""), as.numeric))
 m1
 #      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
 #[1,]    1    0    0    0    1    0    1    0
 #[2,]    1    0    0    1    0    1    0    0
 #[3,]    0    0    1    0    1    1    0    1

A slightly more faster version is reading the file with fread, then use tstrsplit

library(data.table)
fread("inpfile.txt", colClasses="character")[, tstrsplit(V1, "")]
#    V1 V2 V3 V4 V5 V6 V7 V8
#1:  1  0  0  0  1  0  1  0
#2:  1  0  0  1  0  1  0  0
#3:  0  0  1  0  1  1  0  1

I would also change the delimiter by creating space between each character using awk (if the OP is using linux) and then read with fread (I can't test it as I am on a windows system.)

A faster option may also include using library(iotools)

n <- nchar(scan(file, what="",n=1))
library(iotools)
input.file("inpfile.txt", formatter=dstrfw, 
           col_types=rep("integer",n), widths=rep(1,n))
#  V1 V2 V3 V4 V5 V6 V7 V8
#1  1  0  0  0  1  0  1  0
#2  1  0  0  1  0  1  0  0
#3  0  0  1  0  1  1  0  1

Benchmarks

Using a slightly bigger dataset, the timings between readr and iotools are below.

n <-100000
cat(gsub("([[:alnum:]]{8})", "\1
", paste(sample(0:1, 
                n*8, TRUE), collapse="")), 
              file="dat2.txt")
library(readr)
tic <- Sys.time()
read_fwf("dat2.txt", fwf_widths(rep(1, 8)))
difftime(Sys.time(), tic)
#Time difference of 1.142145 secs

tic <- Sys.time()
input.file("dat2.txt", formatter=dstrfw, 
  col_types=rep("integer",8), widths=rep(1,8))
difftime(Sys.time(), tic)
#Time difference of 0.7440939 secs

library(LaF)
tic <- Sys.time()
laf <- laf_open_fwf("dat2.txt", column_widths = rep(1, 
    8),  column_types=rep("integer", 8))
## further processing (larger in memory)
dat <- laf[,]
difftime(Sys.time(), tic)
#Time difference of 0.1285172 secs

The most efficient so far is library(LaF) posted by @Tyler Rinker, followed by library(iotools)

R - Read binary matrix with no separator

Answers (2)

Benchmarks

Related Questions