Tony
Tony

Reputation: 149

Plyr/data.table subsetting

I've got data that is arrayed in a way that has day/month/year in separate columns (which is useful for how I need to subset the data). Each column besides the D/M/Y are columns of values. In my real data, these are counties in the U.S.

I need to sum different ranges of dates that vary by different day, month and year combinations. Given that each column is a county, the ddply/data.table/apply functions seem to be the most appropriate to solve the problem. However, I cannot figure out how to subset my data.

Here is a trivial example:

set.seed(1234)

data <- data.frame(day=rep(seq(1,5),9), month= rep(c(rep(1,5),rep(2,5),rep(11,5)),3), 
                   year=c(rep(1999,15),rep(2000,15),rep(2001,15)), Abel = round(runif(n=45, 0, 5),2) ,
                   Bertha=round(runif(n=45, 0, 5),2),  Charlie= round(runif(n=45,0, 5),2))

dt <- data.table(data)
keycols <- c("day", "month", "year")
setkeyv(dt, keycols) #not sure if this is necessary

Originally, I was thinking of using the ddply() command with the summarize option but have seen other instances where people have recommended using the data.table framework. However, I cannot figure out a way to subset the data in such a way that would get it to match results as follows:

sum(data$Abel[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.41
sum(data$Bertha[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #5.51
sum(data$Charlie[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.19

While this example might seem easy, I have to calculate 10-day averages of climate data for hundreds of counties over decades, so getting the ddply/data.table framework to work would really help!

I have tried both data.table and ddply, but both with errors:

dt[i=list(year==1999, month==2, day >=1, day<3), sum]
#Error in `[.data.table`(dt, i = list(year == 1999, month == 2, day >=  : 
#  typeof x.day (integer) != typeof i.day (logical)

I realize that this means that I'm doing something wrong with setting up the data.table, but I haven't used it before and have looked through several help files and can't figure out the subset issues.

For ddply, I get a different error when I just try to use one singular column to try and get it to work:

ddply(data, .(year, month, day, Abel), summarize, sum.val = sum )
#Error in vector(type, length) : 
#  vector: cannot make a vector of mode 'builtin'.

Any guidance on solving this type of problem would be greatly appreciated!

Upvotes: 2

Views: 492

Answers (2)

Paulo E. Cardoso
Paulo E. Cardoso

Reputation: 5856

Try this to start

library(reshape2)
datam <- melt(data, id.vars = c('day', 'month', 'year'),
              variable.name = "name")
library(dplyr)
datam %>%
  filter(year==1999, month==2, day %in% 1:3) %>%
  group_by(year, month, name) %>%
  summarise(summed = sum(value))


Source: local data frame [3 x 4]
Groups: year, month

  year month    name summed
1 1999     2    Abel   4.41
2 1999     2  Bertha   5.51
3 1999     2 Charlie   4.19

Upvotes: 1

Mike.Gahan
Mike.Gahan

Reputation: 4615

I think you are close with data.table, you just need to modify how you are using the i statement.

dt[year==1999 & month==2 & between(day,0,2), lapply(.SD,sum), .SDcols=4:6]

   Abel Bertha Charlie
1: 3.25   1.92    4.06

You might also be interested in using the ISOdate function to make it a bit easier to work with your data.

dt[ , Date := ISOdate(year, month, day)]

From reading some of the comments in the other answer, it seems like this might be more of a rolling calc question. This https://github.com/mgahan/boRingTrees R package might of some help of you need running sums or averages or whatever.

Upvotes: 3

Related Questions