lpryor
lpryor

Reputation: 363

r - getting vectors from rows in data.table

I've got a big data.table that's been read in from a csv file (essentially the results of an SQL query). Each record has several sets of 20 fields which are the annual figures for a particular thing (eg, a particular type of payment). Here's a simplified version, with just 5 rows, and 3 fields of annual contributions

> dt <- data.table(id=1:5, dob = sample(1950:2000, 5), cont01=11:15, cont02=21:25, cont03=31:35)
> dt
   id  dob cont01 cont02 cont03
1:  1 1981     11     21     31
2:  2 1954     12     22     32
3:  3 1985     13     23     33
4:  4 1986     14     24     34
5:  5 1970     15     25     35

I want to get as a minimum a list of vectors, one vector for each record:

list (c(11, 21, 31), c(12, 22, 32), c(13, 23, 33), c(14, 24, 34), c(15, 25, 35))

Ideally though, I think I'd like the vectors to be in the data table, as a new column. And even more ideally, I need the vectors to be fixed length, with each element the payment at a specific age. So the vector for these 3 columns for the first row would be

> c(rep(0, 5), 11, 21, 31, rep(0, 38))

where the first age in the vector is 15, and the last is 60.

It seems from this question that it's possible to have data.table columns of lists. But I haven't been able to work out how to create the contents from other columns in the same row.

For example:

> dt[1, list(list(c(.SD))), .SDcols=c("cont2011", "cont2012", "cont2013")]
       V1
1: <list>
> dt[1, list(list(c(.SD))), .SDcols=c("cont2011", "cont2012", "cont2013")][,V1]
[[1]]
[[1]]$`cont2011`
[1] 11

[[1]]$cont2012
[1] 21

[[1]]$cont2013
[1] 31

doesn't really seem to be giving what I want, in that I don't see how I can do nice vectory things on the contents of the V1 column. (I'm going to need to do lots of Euclidean distance type things on the vectors I get).

Any ideas? Suggestions for alternative approaches? There are around 13 million rows in the data.table, and 5 sets of 20 (or so) columns that I want to be able to do this to.

Upvotes: 1

Views: 292

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28705

You can create a new column of lists with the code below.

dt[, newcol := .(.(c(cont01, cont02, cont03))), by = id]

#    id  dob cont01 cont02 cont03   newcol
# 1:  1 1993     11     21     31 11,21,31
# 2:  2 1960     12     22     32 12,22,32
# 3:  3 1977     13     23     33 13,23,33
# 4:  4 1955     14     24     34 14,24,34
# 5:  5 1959     15     25     35 15,25,35

Another option which gives the same result is

dt[, newcol := .(.(unlist(.SD))), by = id, .SDcols = names(dt)[3:5]]

The function . is just an alias for list which is available within [.data.table.

Upvotes: 2

akrun
akrun

Reputation: 887951

We subset the columns of the dataset, split by the row index of the data and unlist

subdt <- as.data.frame(dt[, 3:5, with = FALSE])
lapply(split(subdt, row(subdt)), unlist, use.names = FALSE)

Or after subsetting the columns, take the transpose to convert it to a matrix, and then split by the col index of the matrix

m1 <- t(dt[, 3:5]) 
unname(split(m1, col(m1)))
#[[1]]
#[1] 11 21 31

#[[2]]
#[1] 12 22 32

#[[3]]
#[1] 13 23 33

#[[4]]
#[1] 14 24 34

#[[5]]
#[1] 15 25 35

Upvotes: 1

Related Questions