Reputation: 85
I have a txt like
0010
1101
1110
and I hope to fread() txt into data frame like
0 0 1 0
1 1 0 1
1 1 1 0
although fread() + strsplit() could do this, applying it to >1M row takes too long. Is there a way to tell fread() to directly split each row into single character/integer? Thanks for your help.
Upvotes: 1
Views: 1026
Reputation: 473
You could use tsrtsplit
from data.table
instead, which gives a nice speed boost.
library(data.table)
raw <- sample(1000:9999, size = 100000, replace = TRUE)
writeLines(as.character(raw), con = "tst.txt")
# Tom Kelly's suggestion
system.time(tmp1 <- t(as.data.frame(strsplit(readLines("tst.txt"), ""))))
# user system elapsed
# 19.280 0.522 22.109
# with tstrsplit
system.time({
tmp2 <- fread(file = "tst.txt", colClasses = "character")
tmp2[, c("var1", "var2", "var3", "var4") := tstrsplit(V1, split = "")]})
# user system elapsed
# 0.089 0.002 0.099
# read.fwf
system.time(tmp3 <- read.fwf('tst.txt', rep(1, 4)))
# user system elapsed
# 1.308 2.301 3.666
Upvotes: 2
Reputation: 388982
You can read this as fixed-width file to get every character as separate column.
data <- read.fwf('temp.txt', rep(1, 4))
You can also look into readr::read_fwf
which is more faster than read.fwf
.
Upvotes: 1
Reputation: 1598
This is not supported by data.table::fread
as mentioned in the documentation.
sep: The separator between columns. Defaults to the character in the set ‘[,\t |;:]’ that separates the sample of rows into the most number of lines with the same number of fields. Use ‘NULL’ or ‘""’ to specify no separator; i.e. each line a single character column like ‘base::readLines’ does.
Calling data.frame::fread
doesn't do this.
> data.table::fread("test.txt", data.table=FALSE, sep="")
V1
1 1001
2 1101
3 1011
For example readLines
will read as a vector and strsplit
will return a list.
> readLines("test.txt")
[1] "1001" "1101" "1011"
> strsplit(readLines("test.txt"), "")
[[1]]
[1] "1" "0" "0" "1"
[[2]]
[1] "1" "1" "0" "1"
[[3]]
[1] "1" "0" "1" "1"
As data.frame is a list where each element is a column so you need the transpose of this.
> t(as.data.frame(strsplit(readLines("test.txt"), "")))
[,1] [,2] [,3] [,4]
c..1....0....0....1.. "1" "0" "0" "1"
c..1....1....0....1.. "1" "1" "0" "1"
c..1....0....1....1.. "1" "0" "1" "1"
Upvotes: 1