Reputation: 443
Let say I have this data.table:
df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))
I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.
Result is like this:
date mean
20180102 1.5
20180103 2.5
20180104 3.5
20180105 4.5
....
Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...
The for loop is like:
datelist = df[, .(date)]
# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
today = as.numeric(datelist[i])
yesterday = as.numeric(datelist[i-1])
temp = df[date >= yesterday & date <= today]
temp = temp[, .(mean(value))]
temp = cbind(datelist[i], mean = temp$V1)
if (is.na(data)[1]){
data=temp
} else {
data=rbind(data,temp)
}
}
You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)
This is slow and inefficient as I have millions of data point
Is there anyway I can do this in data.table syntax:
result = df[, { data = .SD[date >= yesterday & date <= today]
mean = mean(data$value)
list(mean = mean)}, by=.(date)]
I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?
What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...
Thanks
Upvotes: 0
Views: 214
Reputation: 5281
What about something like this
(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5
And here creating the data frame
data.table::data.table(date = .subset2(df,1)[-1],
mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
# date mean
# 1 20180102 1.5
# 2 20180103 2.5
# 3 20180104 3.5
# 4 20180105 4.5
# 5 20180106 5.5
# ...
with the data you provided.
Here are some benchmarking figures:
# create a bigger data frame
dfLarge <- data.table::data.table(
date = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
value = 1:10958
)
microbenchmark::microbenchmark(sol = {
data.table::data.table(date = .subset2(dfLarge,1)[-1],
mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
# expr min lq mean median uq max neval
# sol 367.955 423.203 921.4908 530.781 788.969 22095.85 100
If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.
That being said, here is a link providing some great information about subsetting in R
.
Upvotes: 0
Reputation: 2301
Staying away from for loops you could use a purrr map function like this:
nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)
The output vector is:
1.0 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5
Which is I think what you want.
Upvotes: 0
Reputation: 5405
You can use the shift
operator in the data.table
j
clause:
df[order(date),
rollmean := (value + shift(value, n = 1, type = "lag"))/2][]
date value rollmean
1: 20180101 1 NA
2: 20180102 2 1.5
3: 20180103 3 2.5
4: 20180104 4 3.5
5: 20180105 5 4.5
6: 20180106 6 5.5
7: 20180107 7 6.5
8: 20180108 8 7.5
...
Upvotes: 3