Reputation: 2534
I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A
, B
, and Weight
for the weighted averages. I also want to move over related data such as Site
, and want to sum the number of TRUE
and FALSE
. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A
and B
for which I want to calculate weighted averages. I also have several columns similar to Site
that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply
across dataframes in a list, but my attempts have not been so great so far.
Upvotes: 3
Views: 250
Reputation: 887971
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
In response to @jazzuro's comment, Using dplyr 0.3
, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1
Upvotes: 2
Reputation: 69251
The trick is to create a function that works for a single data.frame, then use lapply
to iterate across your list. Since lapply
returns a list, we'll then use do.call
to rbind
the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl()
and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
Upvotes: 2