Reputation: 469
My data is of the following form
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), Day_1 = c(8, 7, 8, 9, 6, 6), Day_2 = c(94, 94, 102, 97,
102, 100), Day_3 = c(104, 162, 133, 142, 96, 122)), .Names = c("atp",
"len", "Day_1", "Day_2", "Day_3"), row.names = c(NA, -6L), class = "data.frame")
I want to attain the following output
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), Day_1 = c(8, 7, 8, 9, 6, 6), Day_2 = c(94, 94, 102, 97,
102, 100), Day_3 = c(104, 162, 133, 142, 96, 122), output = c(94,
NA, 133, NA, NA, 6)), .Names = c("atp", "len", "Day_1", "Day_2",
"Day_3", "output"), row.names = c(NA, -6L), class = "data.frame")
which is basically depending on column 2 values it picks the value from column 3, 4 or 5.
I have achieved it through the following code
result<-cbind(y, output=apply(y, 1, function(r) r[r["len"]+2]))
But this process is very time taking. Is there any way to speed up the process? How can I use data.tables for this?
Upvotes: 4
Views: 186
Reputation: 25608
Turns out, my solution is misleading, sorry for that.
Anyway, here's some interesting benchmarking: http://pastebin.com/adwmFRXP
For N = 1e3:
test replications elapsed relative user.self sys.self user.child sys.child
3 sven() 100 0.03 1.000 0.03 0.00 NA NA
5 codoremifa() 100 0.07 2.333 0.07 0.00 NA NA
4 shadow() 100 0.21 7.000 0.19 0.00 NA NA
1 default() 100 0.59 19.667 0.60 0.00 NA NA
2 tonytonov() 100 1.31 43.667 1.04 0.27 NA NA
For N = 1e4:
test replications elapsed relative user.self sys.self user.child sys.child
3 sven() 50 0.02 1.0 0.02 0.00 NA NA
5 codoremifa() 50 0.03 1.5 0.03 0.00 NA NA
4 shadow() 50 0.09 4.5 0.09 0.00 NA NA
2 tonytonov() 50 0.57 28.5 0.45 0.12 NA NA
1 default() 50 2.93 146.5 2.93 0.00 NA NA
For N = 1e5:
test replications elapsed relative user.self sys.self user.child sys.child
3 sven() 10 0.01 1 0.02 0.00 NA NA
5 codoremifa() 10 0.02 2 0.02 0.00 NA NA
4 shadow() 10 0.03 3 0.03 0.00 NA NA
2 tonytonov() 10 0.12 12 0.11 0.02 NA NA
1 default() 10 8.75 875 8.66 0.01 NA NA
For N = 1e6:
test replications elapsed relative user.self sys.self user.child sys.child
3 sven() 10 0.01 1 0.02 0.00 NA NA
5 codoremifa() 10 0.01 1 0.02 0.00 NA NA
4 shadow() 10 0.03 3 0.03 0.00 NA NA
2 tonytonov() 10 0.13 13 0.11 0.01 NA NA
1 default() 10 86.73 8673 85.89 0.56 NA NA
Upvotes: 2
Reputation: 81683
One possible approach:
result <- cbind(y,
output = unlist(y[3:5])[nrow(y) * (y$len -1) + seq.int(nrow(y))])
Another one (this should be faster):
result <- cbind(y, output = y[3:5][cbind(seq.int(nrow(y)), y$len)])
Both approaches result in:
# atp len Day_1 Day_2 Day_3 output
# 1 1 2 8 94 104 94
# 2 0 NA 7 94 162 NA
# 3 1 3 8 102 133 133
# 4 0 NA 9 97 142 NA
# 5 0 NA 6 102 96 NA
# 6 1 1 6 100 122 6
Upvotes: 3
Reputation: 22293
Here's an answer using data.table
. This should be the fastest version for large data.frame
's. However, it also explicitly uses the structure of the data you provided, so it may not be that easily generalizable.
dt <- data.table(y)
dt[, output:=ifelse(len==1, Day_1,
ifelse(len==2, Day_2, Day_3))]
Upvotes: 0
Reputation: 12875
Where dt1 is your first dataset -
for ( i in unique(dt1[!is.na(dt1$len),'len']))
{
dt1[dt1$len == i & !is.na(dt1$len),'Output'] <- dt1[dt1$len == i & !is.na(dt1$len),paste0('Day_',i)]
}
Upvotes: 1