Plyr/data.table subsetting

Question

I've got data that is arrayed in a way that has day/month/year in separate columns (which is useful for how I need to subset the data). Each column besides the D/M/Y are columns of values. In my real data, these are counties in the U.S.

I need to sum different ranges of dates that vary by different day, month and year combinations. Given that each column is a county, the ddply/data.table/apply functions seem to be the most appropriate to solve the problem. However, I cannot figure out how to subset my data.

Here is a trivial example:

set.seed(1234)

data <- data.frame(day=rep(seq(1,5),9), month= rep(c(rep(1,5),rep(2,5),rep(11,5)),3), 
                   year=c(rep(1999,15),rep(2000,15),rep(2001,15)), Abel = round(runif(n=45, 0, 5),2) ,
                   Bertha=round(runif(n=45, 0, 5),2),  Charlie= round(runif(n=45,0, 5),2))

dt <- data.table(data)
keycols <- c("day", "month", "year")
setkeyv(dt, keycols) #not sure if this is necessary

Originally, I was thinking of using the ddply() command with the summarize option but have seen other instances where people have recommended using the data.table framework. However, I cannot figure out a way to subset the data in such a way that would get it to match results as follows:

sum(data$Abel[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.41
sum(data$Bertha[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #5.51
sum(data$Charlie[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.19

While this example might seem easy, I have to calculate 10-day averages of climate data for hundreds of counties over decades, so getting the ddply/data.table framework to work would really help!

I have tried both data.table and ddply, but both with errors:

dt[i=list(year==1999, month==2, day >=1, day<3), sum]
#Error in `[.data.table`(dt, i = list(year == 1999, month == 2, day >=  : 
#  typeof x.day (integer) != typeof i.day (logical)

I realize that this means that I'm doing something wrong with setting up the data.table, but I haven't used it before and have looked through several help files and can't figure out the subset issues.

For ddply, I get a different error when I just try to use one singular column to try and get it to work:

ddply(data, .(year, month, day, Abel), summarize, sum.val = sum )
#Error in vector(type, length) : 
#  vector: cannot make a vector of mode 'builtin'.

Any guidance on solving this type of problem would be greatly appreciated!

Mike.Gahan · Accepted Answer

I think you are close with data.table, you just need to modify how you are using the i statement.

dt[year==1999 & month==2 & between(day,0,2), lapply(.SD,sum), .SDcols=4:6]

   Abel Bertha Charlie
1: 3.25   1.92    4.06

You might also be interested in using the ISOdate function to make it a bit easier to work with your data.

dt[ , Date := ISOdate(year, month, day)]

From reading some of the comments in the other answer, it seems like this might be more of a rolling calc question. This https://github.com/mgahan/boRingTrees R package might of some help of you need running sums or averages or whatever.

Plyr/data.table subsetting

Answers (2)

Related Questions