Reputation: 234
I have two data frames that look like this:
> head(y,n=4)
Source: local data frame [6 x 3]
Start Date End Date Length
1 2006-06-08 2006-06-10 3
2 2006-06-12 2006-06-14 3
3 2006-06-18 2006-06-21 4
4 2006-06-24 2006-06-25 2
and
> head(x,n=19)
Date Group.Size
413 2006-06-07 6
414 2006-06-08 3
415 2006-06-09 1
416 2006-06-10 3
417 2006-06-11 15
418 2006-06-12 12
419 2006-06-13 NA
420 2006-06-14 4
421 2006-06-15 8
422 2006-06-16 3
423 2006-06-17 1
424 2006-06-18 3
425 2006-06-19 10
426 2006-06-20 2
427 2006-06-21 7
428 2006-06-22 6
429 2006-06-23 2
430 2006-06-24 1
431 2006-06-25 0
I'm looking for a way to add a new column in data frame y that will show the average Group.Size of data frame x (rounded to nearest integer), depending on the given Start Date and End Dates provided in y.
For example, in the first row of y, I have 6/8/06 to 6/10/06. This is a length of 3 days, so I would want the new column to have the number 2, because the corresponding Group.Size values are 3, 1, and 3 for the respective days in data frame x (mean=2.33, rounded to nearest integer is 2).
If there is an NA in my dataframe x, I'd like to consider it a 0.
There are multiple steps involved in this task, and there is probably a straightforward approach... I am relatively new to R, and am having a hard time breaking it down. Please let me know if I should clarify my example.
Upvotes: 3
Views: 3151
Reputation: 5532
Here is a different dplyr
solution
library(dplyr)
na2zero <- function(x) ifelse(is.na(x),0,x) # Convert NA to zero
ydf %>%
group_by(Start_Date, End_Date) %>%
mutate(avg = round(mean(na2zero(xdf$Group.Size[ between(xdf$Date, Start_Date, End_Date) ])), 0)) %>%
ungroup
## Start_Date End_Date Length avg
## (time) (time) (int) (dbl)
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
Upvotes: 1
Reputation: 5660
#Replace missing values in x with 0
x[is.na(x)] <- 0
#Create new 'Group' variable and loop through x to create groups
x$Group <-1
j <- 1
for(i in 1:nrow(x)){
if(x[i,"Date"]==y[j,"StartDate"]){
x[i,"Group"] <- j+1
if(j<nrow(y)){
j <- j+1
} else{
j <- j
}
}else if(i>1){
x[i,"Group"] <- x[i-1,"Group"]
}else {
x[i,"Group"] <- 1
}
}
#Use tapply function to get the rounded mean of each Group
tapply(x$Group.Size, x$Group, function(z) round(mean(z)))
Upvotes: 1
Reputation: 755
Assuming that x$Date
, y$StartDate
, and y$EndDate
are of class Date
(or, character
), the following apply
approach should be doing the trick:
y$AvGroupSize<- apply(y, 1, function(z) {
round(mean(x$Group.Size[which(x$Date >= z[1] & x$Date <=z[2])], na.rm=T),0)
}
)
Upvotes: 2
Reputation: 28441
There are many ways but here is one. We can first create a list of date positions with lapply
(SN: Be sure that the dates are in chronological order). Then we map the function round(mean(Group.Size))
to each of the values:
lst <- lapply(y[1:2], function(.x) match(.x, x[,"Date"]))
y$avg <- mapply(function(i,j) round(mean(x$Group.Size[i:j], na.rm=TRUE)), lst[[1]],lst[[2]])
y
# StartDate EndDate Length avg
# 1 2006-06-08 2006-06-10 3 2
# 2 2006-06-12 2006-06-14 3 8
# 3 2006-06-18 2006-06-21 4 6
# 4 2006-06-24 2006-06-25 2 0
Upvotes: 0
Reputation: 15897
This is a solution that applies over the rows of the data frame y
:
library(dplyr)
get_mean_size <- function(start, end, length) {
s <- sum(filter(x, Date >= start, Date <= end)$Group.Size, na.rm = TRUE)
round(s/length)
}
y$Mean.Size = Map(get_mean_size, y$Start_Date, y$End_Date, y$Length)
y
## Start_Date End_Date Length Mean.Size
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
It uses two functions from the dplyr
package: filter()
and mutate()
.
First I define the function get_mean_size
that is supposed with the three values from a column in y
: Start_Date
, End_Date
and length
. It fist selects the relevant rows from x
using filter and sums up the column Group.Size
. Using na.rm = TRUE
tells sum()
to ignore NA
values, which is the same as setting them to zero. Then the average is calculated by dividing by length
and rounding. Note that round
rounds half to even, thus 0.5 is rounded to 0, while 1.5 is rounded to 2.
This function is then applied to all rows of y
using Map()
and added as a new column to y
.
A final note regarding the dates in x
and y
. This solution assumes that the dates are stored as Date
object. You can check this using, e. g.,
is(x$Date, "Date")
If they do not have class Date
, you can convert them using
x$Date <- as.Date(x$Date)
(and simliarly for y$Start_Date
and y$End_Date
).
Upvotes: 0