dww
dww

Reputation: 31452

Efficiently read data files where columns are arranged in rows

Suppose I have a csv-like datafile that looks like the following, where the "columns" of data are arranged as rows:

col1,1.1,1.2,3.3
col2,A,B,C
col3,TRUE,TRUE,FALSE
col4,1,2,3
col5,1,2,3
col6,1,2,3
col7,1,2,3
col8,1,2,3
col9,1,2,3
col10,1,2,3
col11,1,2,3
col12,1,2,3
col13,1,2,3
col14,1,2,3
col15,1,2,3

How do I efficiently and robustly read files like this into R. Ideally I'd like a scalable, fast solution akin to data.table::fread which automatically determines type of data.

For the above example (if it is in a file called test.csv), I can do the following:

library(data.table)    
dt = strsplit(read_lines('test.csv'),',') %>%
  lapply(function(r) fread(paste0(r, collapse ='\n'))) %>%
  as.data.table()

str(dt)
# Classes ‘data.table’ and 'data.frame':    3 obs. of  15 variables:
# $ col1 : num  1.1 1.2 3.3
# $ col2 : chr  "A" "B" "C"
# $ col3 : logi  TRUE TRUE FALSE
# $ col4 : int  1 2 3
# $ col5 : int  1 2 3
# $ col6 : int  1 2 3
# $ col7 : int  1 2 3
# $ col8 : int  1 2 3
# $ col9 : int  1 2 3
# $ col10: int  1 2 3
# $ col11: int  1 2 3
# $ col12: int  1 2 3
# $ col13: int  1 2 3
# $ col14: int  1 2 3
# $ col15: int  1 2 3
# - attr(*, ".internal.selfref")=<externalptr> 
#

However this has some shortcomings. Apart from losing the speed and efficiency of fread, and requiring the separator to be known in advance, it is also not very robust.

For example, the line

col4,"hello, world","hello, world","hello, world"

will break it, as strsplit does not know whether a comma is a separator or part of a string.

Missing data are also problematic:

col5,1,2,

produces

Warning message: In data.table(list(col1 = c(1.1, 1.2, 3.3)), list(col2 = c("A", : Item 5 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)

Is there a better way to read data like this?

Upvotes: 1

Views: 93

Answers (1)

akrun
akrun

Reputation: 887621

An easier option would be to transpose after reading the dataset and then do the type.convert

dat <- read.csv("test.csv", header = FALSE, stringsAsFactors = FALSE)
dat2 <- type.convert(setNames(as.data.frame(t(dat[-1]),
        stringsAsFactors = FALSE), dat$V1), as.is = TRUE)
row.names(dat2) <- NULL

str(dat2)
#'data.frame':  3 obs. of  15 variables:
# $ col1 : num  1.1 1.2 3.3
# $ col2 : chr  "A" "B" "C"
# $ col3 : logi  TRUE TRUE FALSE
# $ col4 : int  1 2 3
# $ col5 : int  1 2 3
# $ col6 : int  1 2 3
# $ col7 : int  1 2 3
# $ col8 : int  1 2 3
# $ col9 : int  1 2 3
# $ col10: int  1 2 3
# $ col11: int  1 2 3
# $ col12: int  1 2 3
# $ col13: int  1 2 3
# $ col14: int  1 2 3
# $ col15: int  1 2 3

Or we read with fread and then do the same transpose

library(data.table)
dt <- fread("test.csv", header = FALSE)
type.convert(setNames(as.data.frame(t(dt[, -1, with = FALSE]), 
       stringsAsFactors = FALSE), dt[[1]], as.is = TRUE)

Or as @Frank suggested

fread("test.csv")[, setnames(transpose(.SD[,-1]), .SD[[1]])][, 
        lapply(.SD, type.convert)]

Upvotes: 2

Related Questions