Thomas
Thomas

Reputation: 2534

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:

dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"), 
                      df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")), 
                                      .Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")), 
                 .Names = c("df1", "df2"))

In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:

result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"), 
    A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L, 
    1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))


    Site    A.Weight    B.Weight    Sum.Weight
1   X       4.5         6           2
2   Y       8.0         4           1

The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.

I'm able to manually calculate weighted averages using something like

weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)

but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.

Upvotes: 3

Views: 250

Answers (2)

akrun
akrun

Reputation: 887971

using dplyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')
 library(tidyr)

 unnest(dfs) %>%
           group_by(Site) %>% 
           filter(Weight) %>% 
           mutate(Sum=n()) %>%
           select(-Weight) %>% 
           summarise_each(funs(mean=mean(., na.rm=TRUE)))

gives the result

 #  Site   A B Sum
 #1    X 4.5 6   2
 #2    Y 8.0 4   1

Or using data.table

 library(data.table)
 DT <- rbindlist(dfs)
 DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE), 
                Sum=.N), by = Site, .SDcols = c("A", "B")]
 #   Site   A B Sum
 #1:    X 4.5 6   2
 #2:    Y 8.0 4   1

Update

In response to @jazzuro's comment, Using dplyr 0.3, I am getting

   unnest(dfs) %>% 
             group_by(Site) %>% 
             summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
                    Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
             select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight) 

  #    Site A_weighted.mean B_weighted.mean Sum.Weight
  #1    X             4.5               6          2
  #2    Y             8.0               4          1

Upvotes: 2

Chase
Chase

Reputation: 69251

The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:

foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
  means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
  sumWeight <- sum(data[, weightCol])
  others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
  out <- data.frame(others, means, sumWeight)
  return(out)
}

In action:

do.call(rbind, lapply(dfs, foo))
---
    Site   A B sumWeight
df1    X 4.5 6         2
df2    Y 8.0 4         1

Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:

do.call(rbind, lapply(dfs, foo, 
                      meanCols = grepl("A|B", names(dfs[[1]])),
                      otherCols = grepl("Site", names(dfs[[1]]))
                      ))

Upvotes: 2

Related Questions