lg929
lg929

Reputation: 234

Find average values of a column in terms of date range of another column in R

I have two data frames that look like this:

> head(y,n=4)
Source: local data frame [6 x 3]

  Start Date   End Date   Length

1 2006-06-08 2006-06-10        3
2 2006-06-12 2006-06-14        3
3 2006-06-18 2006-06-21        4
4 2006-06-24 2006-06-25        2

and

> head(x,n=19)
          Date   Group.Size
413 2006-06-07            6
414 2006-06-08            3
415 2006-06-09            1
416 2006-06-10            3
417 2006-06-11            15
418 2006-06-12            12
419 2006-06-13            NA
420 2006-06-14            4
421 2006-06-15            8
422 2006-06-16            3
423 2006-06-17            1
424 2006-06-18            3
425 2006-06-19            10
426 2006-06-20            2
427 2006-06-21            7
428 2006-06-22            6
429 2006-06-23            2
430 2006-06-24            1
431 2006-06-25            0

I'm looking for a way to add a new column in data frame y that will show the average Group.Size of data frame x (rounded to nearest integer), depending on the given Start Date and End Dates provided in y.

For example, in the first row of y, I have 6/8/06 to 6/10/06. This is a length of 3 days, so I would want the new column to have the number 2, because the corresponding Group.Size values are 3, 1, and 3 for the respective days in data frame x (mean=2.33, rounded to nearest integer is 2).

If there is an NA in my dataframe x, I'd like to consider it a 0.

There are multiple steps involved in this task, and there is probably a straightforward approach... I am relatively new to R, and am having a hard time breaking it down. Please let me know if I should clarify my example.

Upvotes: 3

Views: 3151

Answers (5)

steveb
steveb

Reputation: 5532

Here is a different dplyr solution

library(dplyr)

na2zero <- function(x) ifelse(is.na(x),0,x) # Convert NA to zero
ydf %>%
    group_by(Start_Date, End_Date) %>%
    mutate(avg = round(mean(na2zero(xdf$Group.Size[ between(xdf$Date, Start_Date, End_Date) ])), 0)) %>%
    ungroup

##   Start_Date   End_Date Length   avg
##       (time)     (time)  (int) (dbl)
## 1 2006-06-08 2006-06-10      3     2
## 2 2006-06-12 2006-06-14      3     5
## 3 2006-06-18 2006-06-21      4     6
## 4 2006-06-24 2006-06-25      2     0

Upvotes: 1

Gaurav Bansal
Gaurav Bansal

Reputation: 5660

#Replace missing values in x with 0
x[is.na(x)] <- 0

#Create new 'Group' variable and loop through x to create groups 
x$Group <-1
j <- 1
for(i in 1:nrow(x)){
  if(x[i,"Date"]==y[j,"StartDate"]){
    x[i,"Group"] <- j+1
    if(j<nrow(y)){
      j <- j+1
    } else{
      j <- j 
    }
  }else if(i>1){
    x[i,"Group"] <- x[i-1,"Group"]
  }else {
    x[i,"Group"] <- 1
  }
}

#Use tapply function to get the rounded mean of each Group
tapply(x$Group.Size, x$Group, function(z) round(mean(z)))

Upvotes: 1

submartingale
submartingale

Reputation: 755

Assuming that x$Date, y$StartDate, and y$EndDate are of class Date (or, character), the following apply approach should be doing the trick:

 y$AvGroupSize<- apply(y, 1, function(z) {
                 round(mean(x$Group.Size[which(x$Date >= z[1] & x$Date <=z[2])], na.rm=T),0)
    }
)

Upvotes: 2

Pierre L
Pierre L

Reputation: 28441

There are many ways but here is one. We can first create a list of date positions with lapply (SN: Be sure that the dates are in chronological order). Then we map the function round(mean(Group.Size)) to each of the values:

lst <- lapply(y[1:2], function(.x) match(.x, x[,"Date"]))
y$avg <- mapply(function(i,j) round(mean(x$Group.Size[i:j], na.rm=TRUE)), lst[[1]],lst[[2]])
y
#    StartDate    EndDate Length avg
# 1 2006-06-08 2006-06-10      3   2
# 2 2006-06-12 2006-06-14      3   8
# 3 2006-06-18 2006-06-21      4   6
# 4 2006-06-24 2006-06-25      2   0

Upvotes: 0

Stibu
Stibu

Reputation: 15897

This is a solution that applies over the rows of the data frame y:

library(dplyr)
get_mean_size <- function(start, end, length) {
   s <- sum(filter(x, Date >= start, Date <= end)$Group.Size, na.rm = TRUE)
   round(s/length)
}
y$Mean.Size = Map(get_mean_size, y$Start_Date, y$End_Date, y$Length)
y
##   Start_Date   End_Date Length Mean.Size
## 1 2006-06-08 2006-06-10      3         2
## 2 2006-06-12 2006-06-14      3         5
## 3 2006-06-18 2006-06-21      4         6
## 4 2006-06-24 2006-06-25      2         0

It uses two functions from the dplyr package: filter() and mutate().

First I define the function get_mean_size that is supposed with the three values from a column in y: Start_Date, End_Date and length. It fist selects the relevant rows from x using filter and sums up the column Group.Size. Using na.rm = TRUE tells sum() to ignore NA values, which is the same as setting them to zero. Then the average is calculated by dividing by length and rounding. Note that round rounds half to even, thus 0.5 is rounded to 0, while 1.5 is rounded to 2.

This function is then applied to all rows of y using Map() and added as a new column to y.

A final note regarding the dates in x and y. This solution assumes that the dates are stored as Date object. You can check this using, e. g.,

is(x$Date, "Date")

If they do not have class Date, you can convert them using

x$Date <- as.Date(x$Date)

(and simliarly for y$Start_Date and y$End_Date).

Upvotes: 0

Related Questions