Reputation: 796
I can't find the answer anywhere.
I would like to calculate new variable of data frame which is based on mean of rows.
For example:
data <- data.frame(id=c(101,102,103), a=c(1,2,3), b=c(2,2,2), c=c(3,3,3))
I want to use mutate to make variable d which is mean of a,b and c. And I would like to be able to make that by selecting columns in way d=mean(a,b,c), and also I need to use range of variables (like in dplyr) d=mean(a:c).
And of course
mutate(data, c=mean(a,b))
or
mutate(data, c=rowMeans(a,b))
doesn't work.
Can you give me some tip?
Regards
Upvotes: 47
Views: 54812
Reputation: 1280
You can use a wrapper function around rowMeans()
to make it easier to work with. The one below lets you specify na.rm
, and you can use tidyselect to choose your columns if you want.
# This is the wrapper function
means <- function(..., na.rm = FALSE) {
rowMeans(data.frame(...), na.rm = na.rm)
}
library(dplyr)
# Example data
iris2 <- iris %>%
head() %>%
transmute(Sepal.Length = replace(Sepal.Length,
sample(c(TRUE, FALSE), nrow(.),
replace = TRUE),
NA),
Sepal.Width,
Petal.Length,
Petal.Width) %>%
print()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 NA 3.5 1.4 0.2
#> 2 NA 3.0 1.4 0.2
#> 3 NA 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 NA 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
# Basic usage
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 NA
#> 2 NA 3.0 1.4 0.2 NA
#> 3 NA 3.2 1.3 0.2 NA
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 NA
#> 6 5.4 3.9 1.7 0.4 4.65
# If you want to exclude NAs
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width, na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
# You can also use select() and choose columns using tidyselect
iris2 %>%
mutate(mean_sepal = means(select(., contains("Sepal")), na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
Created on 2022-01-13 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 1118
If you'd like to use a pivot_longer()
-style solution:
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value))%>%
pivot_wider(names_from=name, values_from=value)
Note that this requires the tidyr
package.
This is my preference for the fact that I only need to type the name of my ID column, and don't have to worry about column indices or names otherwise. Good for a quick copy-and-point-this-at-different-data solution, though the same can be said of other answers here. Also good for cases where you might have more than one column with categorical information and haven't created a single unique identifier column.
For what it's worth, I found that this solution is very easily modified to ignore NA values with simple addition of na.rm=TRUE
in the mean calculation.
For example:
data <- data.frame(id=c(101,102,103), a=c(NA,2,3), b=c(2,2,2), c=c(3,3,3))
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value,na.rm=TRUE))%>%
pivot_wider(names_from = name, values_from=value)
Upvotes: 0
Reputation: 2017
I think the answer suggesting using data.frame
or slicing on .
is the best, but could be made simpler and more dplyr-ish like so:
data %>% mutate(c = rowMeans(select(., a,b)))
Or if you want to avoid .
, with the penalty of having two inputs to your pipeline:
data %>% mutate(c = rowMeans(select(data, a,b)))
Upvotes: 12
Reputation: 3883
And yet another couple of ways, useful if you have the numeric positions or vector names of the columns to be summarised:
data %>% mutate(d = rowMeans(.[, 2:4]))
or
data %>% mutate(d = rowMeans(.[, c("a","b","c")]))
Upvotes: 12
Reputation: 420
Another simple possibility with few code is:
data %>%
mutate(c= rowMeans(data.frame(a,b)))
# id a b c
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
As rowMeans needs something like a matrix or a data.frame, you can use data.frame(var1, var2, ...)
instead of c(var1, var2, ...)
. If you have NAs in your data you'll need to tell R what to do, for example to remove them: rowMeans(data.frame(a,b), na.rm=TRUE)
Upvotes: 5
Reputation: 44614
You're looking for
data %>%
rowwise() %>%
mutate(c=mean(c(a,b)))
# id a b c
# (dbl) (dbl) (dbl) (dbl)
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
or
library(purrr)
data %>%
rowwise() %>%
mutate(c=lift_vd(mean)(a,b))
Upvotes: 49
Reputation: 66819
I think this is the dplyr-ish way. First, I'd create a function:
my_rowmeans = function(...) Reduce(`+`, list(...))/length(list(...))
Then, it can be used inside mutate:
data %>% mutate(rms = my_rowmeans(a, b))
# id a b c rms
# 1 101 1 2 3 1.5
# 2 102 2 2 3 2.0
# 3 103 3 2 3 2.5
# or
data %>% mutate(rms = my_rowmeans(a, b, c))
# id a b c rms
# 1 101 1 2 3 2.000000
# 2 102 2 2 3 2.333333
# 3 103 3 2 3 2.666667
To deal with the possibility of NAs
, the function must be uglified:
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
# alternately...
my_rowmeans2 = function(..., na.rm=TRUE) rowMeans(cbind(...), na.rm=na.rm)
# new example
data$b[2] <- NA
data %>% mutate(rms = my_rowmeans(a,b,na.rm=FALSE))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 NA
3 103 3 2 3 2.5
data %>% mutate(rms = my_rowmeans(a,b))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 2.0
3 103 3 2 3 2.5
The downside to the my_rowmeans2
is that it coerces to a matrix. I'm not certain that this will always be slower than the Reduce
approach, though.
Upvotes: 5
Reputation: 545588
dplyr is badly suited to operate on this kind of data because it assumes tidy data format and — for the problem in question — your data is untidy.
You can of course tidy it first:
tidy_data = tidyr::gather(data, name, value, -id)
Which looks like this:
id name value
1 101 a 1
2 102 a 2
3 103 a 3
4 101 b 2
5 102 b 2
6 103 b 2
…
And then:
tidy_data %>% group_by(id) %>% summarize(mean = mean(value))
name mean
(fctr) (dbl)
1 a 2
2 b 2
3 c 3
Of course this discards the original data. You could use mutate
instead of summarize
to avoid this. Finally, you can then un-tidy your data again:
tidy_data %>%
group_by(id) %>%
mutate(mean = mean(value)) %>%
tidyr::spread(name, value)
id mean a b c
(dbl) (dbl) (dbl) (dbl) (dbl)
1 101 2.000000 1 2 3
2 102 2.333333 2 2 3
3 103 2.666667 3 2 3
Alternatively, you could summarise and then merge the result with the original table:
tidy_data %>%
group_by(id) %>%
summarize(mean = mean(value)) %>%
inner_join(data, by = 'id')
The result is the same in either case. I conceptually prefer the second variant.
Upvotes: 20