user2786962
user2786962

Reputation: 469

How to pick a value in data.table?

My data is of the following form

structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, 
NA, 1), Day_1 = c(8, 7, 8, 9, 6, 6), Day_2 = c(94, 94, 102, 97, 
102, 100), Day_3 = c(104, 162, 133, 142, 96, 122)), .Names = c("atp", 
"len", "Day_1", "Day_2", "Day_3"), row.names = c(NA, -6L), class = "data.frame")

I want to attain the following output

structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, 
NA, 1), Day_1 = c(8, 7, 8, 9, 6, 6), Day_2 = c(94, 94, 102, 97, 
102, 100), Day_3 = c(104, 162, 133, 142, 96, 122), output = c(94, 
NA, 133, NA, NA, 6)), .Names = c("atp", "len", "Day_1", "Day_2", 
"Day_3", "output"), row.names = c(NA, -6L), class = "data.frame")

which is basically depending on column 2 values it picks the value from column 3, 4 or 5.

I have achieved it through the following code

result<-cbind(y, output=apply(y, 1, function(r) r[r["len"]+2]))

But this process is very time taking. Is there any way to speed up the process? How can I use data.tables for this?

Upvotes: 4

Views: 186

Answers (4)

tonytonov
tonytonov

Reputation: 25608

Turns out, my solution is misleading, sorry for that.

Anyway, here's some interesting benchmarking: http://pastebin.com/adwmFRXP

For N = 1e3:

          test replications elapsed relative user.self sys.self user.child sys.child
3       sven()          100    0.03    1.000      0.03     0.00         NA        NA
5 codoremifa()          100    0.07    2.333      0.07     0.00         NA        NA
4     shadow()          100    0.21    7.000      0.19     0.00         NA        NA
1    default()          100    0.59   19.667      0.60     0.00         NA        NA
2  tonytonov()          100    1.31   43.667      1.04     0.27         NA        NA

For N = 1e4:

          test replications elapsed relative user.self sys.self user.child sys.child
3       sven()           50    0.02      1.0      0.02     0.00         NA        NA
5 codoremifa()           50    0.03      1.5      0.03     0.00         NA        NA
4     shadow()           50    0.09      4.5      0.09     0.00         NA        NA
2  tonytonov()           50    0.57     28.5      0.45     0.12         NA        NA
1    default()           50    2.93    146.5      2.93     0.00         NA        NA

For N = 1e5:

          test replications elapsed relative user.self sys.self user.child sys.child
3       sven()           10    0.01        1      0.02     0.00         NA        NA
5 codoremifa()           10    0.02        2      0.02     0.00         NA        NA
4     shadow()           10    0.03        3      0.03     0.00         NA        NA
2  tonytonov()           10    0.12       12      0.11     0.02         NA        NA
1    default()           10    8.75      875      8.66     0.01         NA        NA

For N = 1e6:

          test replications elapsed relative user.self sys.self user.child sys.child
3       sven()           10    0.01        1      0.02     0.00         NA        NA
5 codoremifa()           10    0.01        1      0.02     0.00         NA        NA
4     shadow()           10    0.03        3      0.03     0.00         NA        NA
2  tonytonov()           10    0.13       13      0.11     0.01         NA        NA
1    default()           10   86.73     8673     85.89     0.56         NA        NA

Upvotes: 2

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

One possible approach:

result <- cbind(y, 
                output = unlist(y[3:5])[nrow(y) * (y$len -1) + seq.int(nrow(y))])

Another one (this should be faster):

result <- cbind(y, output = y[3:5][cbind(seq.int(nrow(y)), y$len)])

Both approaches result in:

#   atp len Day_1 Day_2 Day_3 output
# 1   1   2     8    94   104     94
# 2   0  NA     7    94   162     NA
# 3   1   3     8   102   133    133
# 4   0  NA     9    97   142     NA
# 5   0  NA     6   102    96     NA
# 6   1   1     6   100   122      6

Upvotes: 3

shadow
shadow

Reputation: 22293

Here's an answer using data.table. This should be the fastest version for large data.frame's. However, it also explicitly uses the structure of the data you provided, so it may not be that easily generalizable.

dt <- data.table(y)
dt[, output:=ifelse(len==1, Day_1, 
                    ifelse(len==2, Day_2, Day_3))]

Upvotes: 0

TheComeOnMan
TheComeOnMan

Reputation: 12875

Where dt1 is your first dataset -

for ( i in unique(dt1[!is.na(dt1$len),'len']))
{
  dt1[dt1$len == i & !is.na(dt1$len),'Output'] <- dt1[dt1$len == i & !is.na(dt1$len),paste0('Day_',i)]
}

Upvotes: 1

Related Questions