Reputation: 149
I've got data that is arrayed in a way that has day/month/year in separate columns (which is useful for how I need to subset the data). Each column besides the D/M/Y are columns of values. In my real data, these are counties in the U.S.
I need to sum different ranges of dates that vary by different day, month and year combinations. Given that each column is a county, the ddply/data.table/apply functions seem to be the most appropriate to solve the problem. However, I cannot figure out how to subset my data.
Here is a trivial example:
set.seed(1234)
data <- data.frame(day=rep(seq(1,5),9), month= rep(c(rep(1,5),rep(2,5),rep(11,5)),3),
year=c(rep(1999,15),rep(2000,15),rep(2001,15)), Abel = round(runif(n=45, 0, 5),2) ,
Bertha=round(runif(n=45, 0, 5),2), Charlie= round(runif(n=45,0, 5),2))
dt <- data.table(data)
keycols <- c("day", "month", "year")
setkeyv(dt, keycols) #not sure if this is necessary
Originally, I was thinking of using the ddply() command with the summarize option but have seen other instances where people have recommended using the data.table framework. However, I cannot figure out a way to subset the data in such a way that would get it to match results as follows:
sum(data$Abel[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.41
sum(data$Bertha[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #5.51
sum(data$Charlie[which(data$year==1999 & data$month==2 & data$day >=1 & data$day <4)]) #4.19
While this example might seem easy, I have to calculate 10-day averages of climate data for hundreds of counties over decades, so getting the ddply/data.table framework to work would really help!
I have tried both data.table and ddply, but both with errors:
dt[i=list(year==1999, month==2, day >=1, day<3), sum]
#Error in `[.data.table`(dt, i = list(year == 1999, month == 2, day >= :
# typeof x.day (integer) != typeof i.day (logical)
I realize that this means that I'm doing something wrong with setting up the data.table, but I haven't used it before and have looked through several help files and can't figure out the subset issues.
For ddply, I get a different error when I just try to use one singular column to try and get it to work:
ddply(data, .(year, month, day, Abel), summarize, sum.val = sum )
#Error in vector(type, length) :
# vector: cannot make a vector of mode 'builtin'.
Any guidance on solving this type of problem would be greatly appreciated!
Upvotes: 2
Views: 492
Reputation: 5856
Try this to start
library(reshape2)
datam <- melt(data, id.vars = c('day', 'month', 'year'),
variable.name = "name")
library(dplyr)
datam %>%
filter(year==1999, month==2, day %in% 1:3) %>%
group_by(year, month, name) %>%
summarise(summed = sum(value))
Source: local data frame [3 x 4]
Groups: year, month
year month name summed
1 1999 2 Abel 4.41
2 1999 2 Bertha 5.51
3 1999 2 Charlie 4.19
Upvotes: 1
Reputation: 4615
I think you are close with data.table
, you just need to modify how you are using the i
statement.
dt[year==1999 & month==2 & between(day,0,2), lapply(.SD,sum), .SDcols=4:6]
Abel Bertha Charlie
1: 3.25 1.92 4.06
You might also be interested in using the ISOdate
function to make it a bit easier to work with your data.
dt[ , Date := ISOdate(year, month, day)]
From reading some of the comments in the other answer, it seems like this might be more of a rolling calc question. This https://github.com/mgahan/boRingTrees R package might of some help of you need running sums or averages or whatever.
Upvotes: 3