ccamara
ccamara

Reputation: 1225

Calculate the mean (and weighted mean) of several columns within a dataframe

Summary

Provided a dataframe in which I have several columns that are variables (each of them being numeric but one, which is a factor) and rows are observations,I would like to create a new column with the mean of all numeric columns + another one with a weighted mean of all numeric columns.

I have found quite some ways that apparently solve this problem (using dplyr, lapply, data.table... ) but none of them work with wide dataframes (and I am not sure I can convert it to long format -see below, and please be patient before marking as duplicate, as I haven't found any answer to my problem).

Long version:

I have a dataframe in wide format like the one provided below (the original one has more than 1700 observations of 20 variables grouped into 30 neighbourhoods) that is the result of of calculating the median of values of each variable:

df = data.frame(matrix(rnorm(15), nrow = 3))
df$neighbour = c("neighbour1", "neighbour2", "neighbour3")

df
> df
          X1         X2         X3         X4        X5  neighbour
1  1.0384405  0.6116994 -0.2075835  0.3206011 1.3855455 neighbour1
2 -0.5115649 -0.7722500  0.8374265 -1.3697758 0.1690452 neighbour2
3  1.0145282  0.6809156 -0.2918737  0.2912297 1.0689213 neighbour3

I would like to create

My first attempt was using dplyr::mutate to create those columns, but I haven't succeeded, most likely because I'm doing it wrong (So If I haven't succeeded with a regular mean, I have no clue of how to perform a weighted mean):

df = df %>%
  mutate(mean = mean(select(-neighbour)))
Error in mutate_impl(.data, dots) : 
  argumento no válido para un operador unitario
> df = df %>%
+   mutate(mean = mean())
Error in mutate_impl(.data, dots) : 
  el argumento "x" está ausente, sin valor por omisión
> df = df %>%
+   mutate(mean = mean(is.numeric()))
Error in mutate_impl(.data, dots) : 
  0 arguments passed to 'is.numeric' which requires 1
> 

Also tried with mutate_each, but I'm assuming that my problem is that I do not know how to pass the right columns to calculate the mean (not to mention that I have no clue about weighted mean).

From what I have read there are many ways to create the desired columns:

Can anyone bring me some light with it? I am so ofuscated right now trying to solve this that I can't see the answer.

EDIT: As per @boshek's answer I have tried to convert from wide to long format and then applying summarise_each, but haven't succeeded neither:

df = df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour, variable) %>%
  summarise_each(., funs=mean)

Upvotes: 0

Views: 7381

Answers (4)

I know I'm a bit late posting this, but I was looking for a solution to a similar problem and found the rowWeightedMeans from the matrixStats library, wich also supports na.rm, you only need to convert to matrix, it works as follows:

library(matrixStats)
df$wmean <- rowWeightedMeans(as.matrix(df[ , c('X1', 'X2', 'X3', 'X4', 'X5')]), w = weight)

This worked perfectly for me and as mentioned, has the extra that supports na.rm = TRUE wich I needed

Upvotes: 0

boshek
boshek

Reputation: 4416

Ok - so you want means ACROSS the row?

I'd use gather from dplyr then merge it back with your original data:

df.mean <- df %>%
  gather(variable, value, -neighbour) %>%
  group_by(neighbour) %>%
  summarise(mean_value=mean(value), wmean_value=weighted.mean(value))

df.comb <- df %>%
  full_join(.,df.mean, by=c("neighbour"))

There are a few ways to skin this cat but this is one.

Is this what you wanted?

Upvotes: 2

joel.wilson
joel.wilson

Reputation: 8413

df$mean <- apply(df[1:5], 1, mean)
df$wt.mean <- apply(df[1:5], 1, weighted.mean, weight)

Upvotes: 3

Rose Hartman
Rose Hartman

Reputation: 487

I think the rowMeans() function in base may be your best bet.

df$mean <- rowMeans(dplyr::select(df, starts_with("X")))

The weighted mean may be more difficult. I couldn't find a quick and clean way to do it, but here's an option that works:

# define a function that calculates a weighted mean
wmean <- function(x, weight){
  stopifnot(length(x) == length(weight))
  if(sum(weight) != 1) {
    message("Rescaling weights to sum to 1")
    weight <- weight/sum(weight)
  }
  wx <- sum(x * weight)
  return(wx)
}
# apply that function row by row to the X columns in df
df$wmean <- apply(X=dplyr::select(df, starts_with("X")), MARGIN=1, FUN=wmean, weight = weight)

Upvotes: 1

Related Questions