Reputation: 565
From a text file I'm reading in binary data structured like this:
0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101
The file has 800 lines. Every line is equally long (but that varies between files, so it doesn't make sense to hard code it). I want the input stored in a data frame, in which every line is a row, and every two numbers are stored in different columns, for example:
col1 col2 col3 col4
0 1 0 1
Currently I am doing it like this
as.matrix(read.table(text=gsub("", ' ', readLines("input"))))->g
However, that takes too long as there are roughly 70,000 0/1's in each line.
Is there a quicker way to do this?
Upvotes: 4
Views: 651
Reputation: 46866
From a subsequent question, from the structure of the data, and from the original solution it seems that you'd like a matrix (since all columns are of the same type) rather than a data.frame indicated in the body of the question (and causing problems down-stream!). The data doesn't seem too big, so read it in and split into individual letters
lns = strsplit(readLines("somefile.txt"), "")
Then unlist, match strings to integer, and reshape as matrix
v = match(unlist(lns), c("0", "1")) - 1L
m = matrix(v, nrow=length(lns), byrow=TRUE)
Or as a function
input2matrix <- function(fname) {
lns = strsplit(readLines("somefile.txt"), "")
v = match(unlist(lns), c("0", "1")) - 1L
matrix(v, nrow=length(lns), byrow=TRUE)
}
This takes about 5s for the 800 x 70000 line example. From comparison with other responses, it is also faster than all other solutions (I couldn't get iotools to install easily, complaining about C-level missing symbol Rspace) and does not making assumptions about OS and availability of OS tools (and knowledge of those tools in addition to R!).
Upvotes: 2
Reputation: 193537
I would recommend exploring read_fwf
from the "readr" package. You can do something like this:
library(readr)
len <- nchar(readLines("yourfile.txt", n = 1))
read_fwf("yourfile.txt", fwf_widths(rep(1, len)))
Alternatively, you can try the "iotools" package, which might be faster:
library(iotools)
len <- nchar(readLines("yourfile.txt", n = 1))
input.file("yourfile.txt", formatter = dstrfw,
col_types = rep("integer", len), widths = rep(1, len))
Here's a small POC:
a <- tempfile()
writeLines("0101010100101010101010101010
1010101001010101010101010111
1111101010101010100101010101", a)
len <- nchar(readLines(a, n = 1))
library(readr)
read_fwf(a, fwf_widths(rep(1, len)))
# Source: local data frame [3 x 28]
#
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28
# 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
# 2 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1
# 3 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1
The dimensions of your data does seem to make read_fwf
choke. I did a small test to compare the "iotools" approach with awk
+ fread
.
Here is the sample data:
## Creates a file named "somefile.txt"
set.seed(1)
A <- replicate(10, sample(0:1, 70000, TRUE), FALSE)
A <- sapply(A, paste, collapse = "")
writeLines(rep(A, 800/length(A)), "somefile.txt")
Here are the functions and results. I've written the functions such that you should be able to try them on your actual data to see which works out best for you.
Obviously, it seems like readr
is out of the picture at this stage :-)
Freadr <- function(infile = "somefile.txt") {
len <- nchar(readLines(infile, n = 1))
read_fwf(infile, fwf_widths(rep(1, len)))
}
system.time(temp1 <- Freadr())
# |===============================================================| 100% 53 MB
# user system elapsed
# 466.740 0.384 466.506
Fiotools <- function(infile = "somefile.txt") {
len <- nchar(readLines(infile, n = 1))
input.file(infile, formatter = dstrfw,
col_types = rep("integer", len), widths = rep(1, len))
}
system.time(temp2 <- Fiotools())
# user system elapsed
# 7.248 0.016 7.273
Fawk <- function(infile = "somefile.txt") {
cmd <- sprintf("awk '{gsub(/./,\"&,\", $1);print $1}' %s", infile)
fread(cmd)
}
system.time(temp3 <- Fawk())
# user system elapsed
# 12.948 0.156 13.109
For that matter, using base R is not too bad either:
fun4 <- function(infile = "somefile.txt") {
do.call(rbind, lapply(strsplit(readLines(infile), "", TRUE), as.numeric))
}
system.time(fun4())
# user system elapsed
# 9.056 0.260 9.304
The result there is a matrix
, so you may need to add a couple of seconds for conversion to a data.frame
or a data.table
if that's really what you want.
Upvotes: 6
Reputation: 887241
You could pipe
with awk
read.table(pipe("awk '{gsub(/./,\"& \", $1);print $1}' yourfile.txt"))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
#1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1
#2 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0
#3 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0
# V22 V23 V24 V25 V26 V27 V28
#1 0 1 0 1 0 1 0
#2 1 0 1 0 1 1 1
#3 1 0 1 0 1 0 1
Or
read.table(pipe("awk '{gsub(\"\",\" \", $1);print $1}' yourfile.txt"))
fread
can also be combined with awk
library(data.table)
fread("awk '{gsub(/./,\"&,\", $1);print $1}' yourfile.txt")
Using a similar dataset as the OP's dataset,
library(stringi)
write.table(stri_rand_strings(800,70000, '[0-1]'), file='binary1.txt',
row.names=FALSE, quote=FALSE, col.names=FALSE)
system.time(fread("awk '{gsub(/./,\"&,\", $1);print $1}' binary1.txt"))
# user system elapsed
#16.444 0.108 16.542
Upvotes: 7