weichen song
weichen song

Reputation: 85

r fread: how to read txt with no separator

I have a txt like

0010

1101

1110

and I hope to fread() txt into data frame like

0 0 1 0

1 1 0 1

1 1 1 0

although fread() + strsplit() could do this, applying it to >1M row takes too long. Is there a way to tell fread() to directly split each row into single character/integer? Thanks for your help.

Upvotes: 1

Views: 1026

Answers (3)

Jakob Gepp
Jakob Gepp

Reputation: 473

You could use tsrtsplit from data.table instead, which gives a nice speed boost.

library(data.table)
raw <- sample(1000:9999, size = 100000, replace = TRUE)
writeLines(as.character(raw), con = "tst.txt")

# Tom Kelly's suggestion
system.time(tmp1 <- t(as.data.frame(strsplit(readLines("tst.txt"), ""))))

# user  system elapsed 
# 19.280   0.522  22.109 

# with tstrsplit
system.time({
  tmp2 <- fread(file = "tst.txt", colClasses = "character")
  tmp2[, c("var1", "var2", "var3", "var4") := tstrsplit(V1, split = "")]})

# user  system elapsed 
# 0.089   0.002   0.099 

# read.fwf
system.time(tmp3 <- read.fwf('tst.txt', rep(1, 4)))

# user  system elapsed 
# 1.308   2.301   3.666 

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

You can read this as fixed-width file to get every character as separate column.

data <- read.fwf('temp.txt', rep(1, 4))

You can also look into readr::read_fwf which is more faster than read.fwf.

Upvotes: 1

This is not supported by data.table::fread as mentioned in the documentation.

sep: The separator between columns. Defaults to the character in
     the set ‘[,\t |;:]’ that separates the sample of rows into
     the most number of lines with the same number of fields. Use
     ‘NULL’ or ‘""’ to specify no separator; i.e. each line a
     single character column like ‘base::readLines’ does.

Calling data.frame::fread doesn't do this.

> data.table::fread("test.txt", data.table=FALSE, sep="")
    V1
1 1001
2 1101
3 1011

For example readLines will read as a vector and strsplit will return a list.

> readLines("test.txt")
[1] "1001" "1101" "1011"
> strsplit(readLines("test.txt"), "")
[[1]]
[1] "1" "0" "0" "1"

[[2]]
[1] "1" "1" "0" "1"

[[3]]
[1] "1" "0" "1" "1"

As data.frame is a list where each element is a column so you need the transpose of this.

> t(as.data.frame(strsplit(readLines("test.txt"), "")))
                      [,1] [,2] [,3] [,4]
c..1....0....0....1.. "1"  "0"  "0"  "1" 
c..1....1....0....1.. "1"  "1"  "0"  "1" 
c..1....0....1....1.. "1"  "0"  "1"  "1" 

Upvotes: 1

Related Questions