Reputation: 589
I have a dataframe with timeseries data columns and start year and end year.
df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64), y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))
For each row I want to calculate mean value within the start and end years and also before and after the start and end period. The desired output looks like:
y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010 startyear endyear before_mean within_mean after_mean
12 234 3434 36 6 56 43 6 64 63 65 2006 2009 629.6666667 44 65
636 76 46 35 64 65 65 56 66 5656 54 2001 2005 636 57.2 1179.4
I have tried different match and index technique but can't wrap my mind around this one.
Upvotes: 3
Views: 672
Reputation: 10421
Another approach than akrun's, also using Base R. We'll create an intermediate variable in the same order as column names but having numerical format. This will be used to refer to columns of the actual dataframe:
col.years <- suppressWarnings(as.numeric(sub("^y", "", colnames(df))))[1:11]
# Initialise everything to NA (better when preparing to loop over df)
df$before_mean <- NA
df$within_mean <- NA
df$after_mean <- NA
for(i in seq_len(nrow(df))) {
df$before_mean[i] <- mean(as.numeric(df[i, which(col.years < df$startyear[i])]))
df$within_mean[i] <- mean(as.numeric(df[i, which((col.years >= df$startyear[i]) & (col.years <= df$endyear[i]))]))
df$after_mean[i] <- mean(as.numeric(df[i, which(col.years > df$endyear[i])]))
}
Results
df[,14:16]
# before_mean within_mean after_mean
# 1 629.6667 44.0 65.0
# 2 636.0000 57.2 1179.4
Upvotes: 3
Reputation: 47546
Here is a solution:
#The original data:
df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64), y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))
df$s = df$startyear - 1999
df$e = df$endyear - 1999
df$before_mean <- apply(df, 1, function(x)sum(x[1:(x[14]-1)] ))
df$within_mean <- apply(df, 1, function(x)sum(x[x[14]:x[15]] ))
df$after_mean <- apply(df, 1, function(x)sum(x[(x[15]+1):11] ))
df$s <- NULL
df$e <- NULL
This solution is tied to the exact years as in the example, but it would not be too hard to make it more generic.
Upvotes: 1
Reputation: 887851
1. dplyr/tidyr
It may be better to convert the 'wide' format to 'long' format. We could use dplyr/tidyr
to get the mean
. Create a 'ind' column, reshape the data to 'long' using gather
, split the 'variable' column into two columns ('var1', 'var2') with extract
, group by 'ind', get the mean
values of the 'value' column after subsetting it based on the different logical index created (i.e. var2 < startyear
, var2 >= startyear & var2 <= endyear
, and var2 >endyear
)
library(dplyr)
library(tidyr)
dS <- df %>%
mutate(ind=row_number()) %>%
gather(variable, value, starts_with('y')) %>%
extract(variable, c('var1', 'var2'), '([^0-9]+)([0-9]+)',
convert=TRUE) %>%
group_by(ind) %>%
summarise(before_mean= mean(value[var2 < startyear]),
within_mean = mean(value[var2 >= startyear &
var2 <= endyear]),
after_mean=mean(value[var2 >endyear])) %>%
as.data.frame()
nm1 <- paste(c('before', 'within', 'after'), 'mean', sep="_")
dS
# ind before_mean within_mean after_mean
#1 1 629.6667 44.0 65.0
#2 2 636.0000 57.2 1179.4
We can create additional columns in 'df' from the above output
df[nm1] <- dS
2. base R
We can use base R
methods and without changing the format of the dataset. From the original dataset ('df'), make an index ('indx') of numeric column names, remove the non-numeric part and convert to numeric ('v1').
indx <- grep('\\d+', names(df))
v1 <- as.numeric(sub('[^0-9]+', '', names(df)[indx]))
Loop the rows of 'df' (lapply
), match
the 'startyear' with 'v1', use that index ('i1') to get the columns, unlist
, and calculate the mean
. The same can be done by matching the 'endyear' with 'v1' to get the index ('i2'). Based on 'i1', and 'i2', calculate the 'within_mean' and 'after_mean'. rbind
the list elements and assign the output to new columns ('nm1') in 'df'.
df[nm1] <- do.call(rbind,lapply(1:nrow(df), function(i) {
i1 <- match(df$startyear[i], v1)
before_mean<- mean(unlist(df[i,1:(i1-1),drop=FALSE]))
i2 <- match(df$endyear[i], v1)
within_mean <- mean(unlist(df[i,i2:i1]))
after_mean <- mean(unlist(df[i,match(v1[(i2+1):length(v1)],v1)]))
data.frame(before_mean,within_mean, after_mean) }))
df[nm1]
# before_mean within_mean after_mean
#1 629.6667 44.0 65.0
#2 636.0000 57.2 1179.4
Upvotes: 3