Reputation: 31452
Suppose I have a csv-like datafile that looks like the following, where the "columns" of data are arranged as rows:
col1,1.1,1.2,3.3
col2,A,B,C
col3,TRUE,TRUE,FALSE
col4,1,2,3
col5,1,2,3
col6,1,2,3
col7,1,2,3
col8,1,2,3
col9,1,2,3
col10,1,2,3
col11,1,2,3
col12,1,2,3
col13,1,2,3
col14,1,2,3
col15,1,2,3
How do I efficiently and robustly read files like this into R. Ideally I'd like a scalable, fast solution akin to data.table::fread
which automatically determines type of data.
For the above example (if it is in a file called test.csv
), I can do the following:
library(data.table)
dt = strsplit(read_lines('test.csv'),',') %>%
lapply(function(r) fread(paste0(r, collapse ='\n'))) %>%
as.data.table()
str(dt)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 15 variables:
# $ col1 : num 1.1 1.2 3.3
# $ col2 : chr "A" "B" "C"
# $ col3 : logi TRUE TRUE FALSE
# $ col4 : int 1 2 3
# $ col5 : int 1 2 3
# $ col6 : int 1 2 3
# $ col7 : int 1 2 3
# $ col8 : int 1 2 3
# $ col9 : int 1 2 3
# $ col10: int 1 2 3
# $ col11: int 1 2 3
# $ col12: int 1 2 3
# $ col13: int 1 2 3
# $ col14: int 1 2 3
# $ col15: int 1 2 3
# - attr(*, ".internal.selfref")=<externalptr>
#
However this has some shortcomings. Apart from losing the speed and efficiency of fread, and requiring the separator to be known in advance, it is also not very robust.
For example, the line
col4,"hello, world","hello, world","hello, world"
will break it, as strsplit does not know whether a comma is a separator or part of a string.
Missing data are also problematic:
col5,1,2,
produces
Warning message: In data.table(list(col1 = c(1.1, 1.2, 3.3)), list(col2 = c("A", : Item 5 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
Is there a better way to read data like this?
Upvotes: 1
Views: 93
Reputation: 887621
An easier option would be to transpose after reading the dataset and then do the type.convert
dat <- read.csv("test.csv", header = FALSE, stringsAsFactors = FALSE)
dat2 <- type.convert(setNames(as.data.frame(t(dat[-1]),
stringsAsFactors = FALSE), dat$V1), as.is = TRUE)
row.names(dat2) <- NULL
str(dat2)
#'data.frame': 3 obs. of 15 variables:
# $ col1 : num 1.1 1.2 3.3
# $ col2 : chr "A" "B" "C"
# $ col3 : logi TRUE TRUE FALSE
# $ col4 : int 1 2 3
# $ col5 : int 1 2 3
# $ col6 : int 1 2 3
# $ col7 : int 1 2 3
# $ col8 : int 1 2 3
# $ col9 : int 1 2 3
# $ col10: int 1 2 3
# $ col11: int 1 2 3
# $ col12: int 1 2 3
# $ col13: int 1 2 3
# $ col14: int 1 2 3
# $ col15: int 1 2 3
Or we read with fread
and then do the same transpose
library(data.table)
dt <- fread("test.csv", header = FALSE)
type.convert(setNames(as.data.frame(t(dt[, -1, with = FALSE]),
stringsAsFactors = FALSE), dt[[1]], as.is = TRUE)
Or as @Frank suggested
fread("test.csv")[, setnames(transpose(.SD[,-1]), .SD[[1]])][,
lapply(.SD, type.convert)]
Upvotes: 2