Reputation: 1171
I want to calculate the mean for several columns and thus create a new column for the mean using dplyr
and without melting + merging.
> head(growth2)
CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr IV13_ha_yr
1 1 6 4.10 6.97 NA NA NA 4.58
2 1 17 9.88 8.75 NA NA NA 8.25
3 1 30 NA NA NA NA NA NA
4 1 37 15.43 15.07 11.89 10.00 12.09 14.33
5 1 41 20.21 15.01 14.72 11.31 13.27 17.09
6 1 46 12.64 14.36 13.65 9.07 12.47 12.36
>
I need a new column within the dataset with the mean of all the IV columns. I tried this:
growth2 %>%
group_by(CODE_COUNTRY, CODE_PLOT) %>%
summarise(IVmean=mean(IV12_ha_yr:IV13_ha_yr, na.rm=TRUE))
And returned several errors depending on the example used, such as:
Error in NA_real_:NA_real_ : NA/NaN argument
or
Error in if (trim > 0 && n) { : missing value where TRUE/FALSE needed
Upvotes: 39
Views: 74576
Reputation: 3690
Here is a tidyverse solution using c_across
which is designed for row-wise aggregations. With c_across
it's easy to refer to columns by name, type or position and to apply any function to the selected columns.
There is a trade-off, though, between generalizability and performance as this generic approach is less efficient than a built-in row-wise operation to calculate a specific summary such as rowMeans
. Here is more from the documentation on row-wise calculations:
The rowwise() approach will work for any summary function. But if you need greater speed, it’s worth looking for a built-in row-wise variant of your summary function. These are more efficient because they operate on the data frame as whole; they don’t split it into rows, compute the summary, and then join the results back together again.
Note: rowwise
is a grouping operation (ie. each row in its own group). Thanks to Matifou for highlighting this detail.
library("tidyverse")
df <-
tibble::tribble(
~CODE_COUNTRY, ~CODE_PLOT, ~IV12_ha_yr, ~IV23_ha_yr, ~IV34_ha_yr, ~IV14_ha_yr, ~IV24_ha_yr, ~IV13_ha_yr,
1L, 6L, 4.1, 6.97, NA, NA, NA, 4.58,
1L, 17L, 9.88, 8.75, NA, NA, NA, 8.25,
1L, 30L, NA, NA, NA, NA, NA, NA,
1L, 37L, 15.43, 15.07, 11.89, 10, 12.09, 14.33,
1L, 41L, 20.21, 15.01, 14.72, 11.31, 13.27, 17.09,
1L, 46L, 12.64, 14.36, 13.65, 9.07, 12.47, 12.36
)
df %>%
rowwise() %>%
mutate(
mean = mean(c_across(starts_with("IV")), na.rm = TRUE),
sd = sd(c_across(starts_with("IV")), na.rm = TRUE)
) %>%
ungroup()
#> # A tibble: 6 × 10
#> CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 4.1 6.97 NA NA NA
#> 2 1 17 9.88 8.75 NA NA NA
#> 3 1 30 NA NA NA NA NA
#> 4 1 37 15.4 15.1 11.9 10 12.1
#> 5 1 41 20.2 15.0 14.7 11.3 13.3
#> 6 1 46 12.6 14.4 13.6 9.07 12.5
#> IV13_ha_yr mean sd
#> <dbl> <dbl> <dbl>
#> 1 4.58 5.22 1.54
#> 2 8.25 8.96 0.835
#> 3 NA NaN NA
#> 4 14.3 13.1 2.14
#> 5 17.1 15.3 3.09
#> 6 12.4 12.4 1.82
Created on 2023-04-17 with reprex v2.0.2
Upvotes: 19
Reputation: 91
I tried to comment on Rick Scriven's answer but don't have the experience points for it. Anyway, wanted to contribute. His answer said to do this:
library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
That works, but if all columns don't start with "IV", which was my case, how do you do it? Turns out, that select does not want a logical vector, so you can't use AND or OR. For example, you cannot say "starts_with('X') | starts_with('Y')". You have to build a numeric vector. Here is how it is done.
mutate(df, IVMean = rowMeans(select(df, c(starts_with("IV"), starts_with("IX"))), na.rm = TRUE))
Upvotes: 4
Reputation: 2381
Use .
in dplyr.
library(dplyr)
mutate(df, IVMean = rowMeans(select(., starts_with("IV")), na.rm = TRUE))
Upvotes: 9
Reputation: 99331
You don't need to group, just select()
and then mutate()
library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
Upvotes: 47
Reputation:
you can use as follows:
your data
data<- structure(list(CODE_COUNTRY = c(1L, 1L, 1L, 1L, 1L, 1L), CODE_PLOT = c(6L,
17L, 30L, 37L, 41L, 46L), IV12_ha_yr = c(4.1, 9.88, NA, 15.43,
20.21, 12.64), IV23_ha_yr = c(6.97, 8.75, NA, 15.07, 15.01, 14.36
), IV34_ha_yr = c(NA, NA, NA, 11.89, 14.72, 13.65), IV14_ha_yr = c(NA,
NA, NA, 10, 11.31, 9.07), IV24_ha_yr = c(NA, NA, NA, 12.09, 13.27,
12.47), IV13_ha_yr = c(4.58, 8.25, NA, 14.33, 17.09, 12.36)), .Names = c("CODE_COUNTRY",
"CODE_PLOT", "IV12_ha_yr", "IV23_ha_yr", "IV34_ha_yr", "IV14_ha_yr",
"IV24_ha_yr", "IV13_ha_yr"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
mydata <- cbind(data,IVMean=apply(data[,3:8],1,mean, na.rm=TRUE))
you can also do this
mydata <- cbind(data,IVMean=rowMeans(data[3:8], na.rm=TRUE))
Upvotes: 0