user2325155
user2325155

Reputation: 147

Function to take averages between dataframes in R

I really don't know where to start with this one, so I'm asking here. I have two data frames:

set.seed(21)
DF1 <- data.frame(year = c(seq(2000,2012,by=1)), 
              C1 = runif(13,0,1),
              C2 = runif(13,0,1),
              C3 = runif(13,0,1),
              C4 = runif(13,0,1),
              C5 = runif(13,0,1))

DF2 <- data.frame(column = c("C1", "C2", "C3", "C4", "C5"),
              start = c(2005,2001,2006,2005,2009),
              end = c(2012,2009,2011,2010,2012))

I need to write a function that has the following steps:

  1. For each row in DF2: Take the corresponding column in DF2$column and take a mean from DF1.

    For example: in DF1$C1, take a mean of values between 2005 and 2012

  2. Report: DF2[1,1], DF2[1,2], DF2[1,3], mean 1

Values that are less than the available data, for example: 2002 - 5 = 1997 but are not available in DF1 would be OK as NA.

An example output:

    > DF2.out
      column start  end        m1 
    1     C1  2005 2012 0.9186834 
    2     C2  2001 2009        NA 

Thank you for your help in advance!

Upvotes: 0

Views: 79

Answers (4)

Bulat
Bulat

Reputation: 6969

You can use mapply to wrap the loop over DF2 rows:

library(data.table) # using for convenience 
DT <- data.table(DF1)
res <- mapply(function(c, start, end) {
         res <- DT[year >= start & year <= end, mean(get(c))]
         return (res)
      } , as.character(DF2$column), DF2$start, DF2$end)
res <- data.frame(res)
res$column <- rownames(res)
res <- merge(DF2, res)
res 

#  column start  end       res
#1     C1  2005 2012 0.5861268
#2     C2  2001 2009 0.3942018
#3     C3  2006 2011 0.5853924
#4     C4  2005 2010 0.4904493
#5     C5  2009 2012 0.6783216

Upvotes: 1

thelatemail
thelatemail

Reputation: 93813

Another attempt using mapply that should be pretty quick as it is just a bit of matrix indexing and selecting:

column <- match(DF2$column, names(DF1) )
start  <- match(DF2$start, DF1$year)
end    <- match(DF2$end, DF1$year)

m1 <- mapply(
  function(r1,r2,co) mean(DF1[cbind(seq(r1,r2), co)]),
  start,
  end,
  column 
)

data.frame(
  column=names(DF1)[column], 
  start=DF1$year[start],
  end=DF1$year[end],
  m1
)

#  column start  end        m1
#1     C1  2005 2012 0.5861268
#2     C2  2001 2009 0.3942018
#3     C3  2006 2011 0.5853924
#4     C4  2005 2010 0.4904493
#5     C5  2009 2012 0.6783216

Upvotes: 0

Pete Barwis
Pete Barwis

Reputation: 114

If I've interpreted your question correctly, the following example should give you what you want, if what you want is the mean for each column in DF1 after each column has been subset to the year range in DF2:

# get the column names from DF2$column
c_list <- as.character(DF2$column)

# for each column name in c_list, store the start and end
# year, and find the mean of the column subset by year range
ml <- do.call(rbind, lapply(1:length(c_list), function(x){

  start <- DF2[x, "start"]
    end <- DF2[x, "end"]

  mean(DF1[DF1$year >= start & DF1$year <= end,  c_list[x]])

}))

# join the means with DF2
DF2.out <- cbind(DF2, ml)

> DF2.out
  column start  end        ml
1     C1  2005 2012 0.5861268
2     C2  2001 2009 0.3942018
3     C3  2006 2011 0.5853924
4     C4  2005 2010 0.4904493
5     C5  2009 2012 0.6783216

Upvotes: 0

pbahr
pbahr

Reputation: 1340

I'm assuming your question is about summarizing one data frame by parameters you have in another data frame. In this case, the code below will help with part 1.

library(dplyr)

apply.by.colname <- function(data, col.name, year.start, year.end) {

    data %>% 
        filter(year >= year.start & year <= year.end) %>% 
        select(matches(col.name))
}

new.df <- apply.by.colname(DF1, "C1", 2005, 2012)
sapply(new.df, mean)

For the complete solution, you may need to use this function in other custom functions or apply calls.

Upvotes: 1

Related Questions