Érico Patto
Érico Patto

Reputation: 1015

How to take mean of values of select columns by rows

My tibble is inserted in the end.

I have a data frame with different types of columns (they are different repetitions). The first four columns should be left as they are; the ones that start_with() (I wrote it like that because maybe the function ca be useful) "canopy" should be summarised into a mean, as well as "understory" (it's written as "under") and "tree diameter-at-breast-height" (tdbh) – including na.rm = TRUE. How can I do this? How can I summarise the columns like this?

Data (head):

structure(list(Site = c("Bala", "Bala", "Bala", "Bala", "Bala", 
"Bala"), Classification = c("Primary forest", "Primary forest", 
"Primary forest", "Primary forest", "Primary forest", "Primary forest"
), transect = c(1, 1, 1, 1, 1, 2), point = c(1, 2, 3, 4, 5, 1
), canopy1 = c(10, 2, 5, 10, 2, 4), canopy2 = c(4, 3, 2, 3, 2, 
6), canopy3 = c(5, 2, 3, 4, 5, 1), canopy4 = c(5, 3, 2, 2, 2, 
2), canopy5 = c(3, 9, 6, 7, 4, 8), under1 = c(15, 17, 4, 23, 
9, 27), under2 = c(13, 1, 0, 0, 0, 0), under3 = c(1, 4, 4, 4, 
5, 23), under4 = c(19, 9, 0, 5, 0, 0), Numtrees = c(4, 3, 6, 
1, 3, 7), t1dbh = c(110, 100, 50, 60, 100, 70), t2dbh = c(80, 
60, 60, NA, 70, 80), t3dbh = c(80, 110, 70, NA, 100, 50), t4dbh = c(90, 
NA, 110, NA, NA, 60), t5dbh = c(NA, NA, 90, NA, NA, 50), t6dbh = c(NA, 
NA, 110, NA, NA, 60), t7dbh = c(NA, NA, NA, NA, NA, 250), t8dbh = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), t9dbh = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), t10dbh = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Desired output starts with:

# A tibble: 2 x 7
  Site  Classification transect point canopy understory tdbh 
  <chr> <chr>             <dbl> <dbl>  <dbl> <chr>      <chr>
1 Bala  Primary forest        1     1    5.4 ...        ...  
2 Bala  Primary forest        1     2    3.8 ...        ...  

I was hoping this were possibly using only base R and anything in tidyverse (probably dplyr and/or tidyr)...

EDIT: I know mutate(canopy = mean(c(canopy1, canopy2, ...))) should work fine, but this has two problems: first, it adds a column instead of replacing. This is a nuiscance, but not really terrible. However, I'd have to list everything. This is the mark of an inefficient answer.

Upvotes: 1

Views: 46

Answers (3)

www
www

Reputation: 39154

A solution using the tidyverse package. We can create a vector with the target strings, and then use map_dfc and mutate to dynamically calcualte the mean. After that, we can combine the calculated columns to the original data frame.

library(tidyverse)

# Set the target column names
target <- c("canopy", "under", "dbh")

# 
dat2 <- map_dfc(target, function(x){
  temp <- dat %>%
    mutate("{x}" := rowMeans(select(., contains(x)), na.rm = TRUE), .keep = "none")
})

dat3 <- dat %>% 
  select(-contains(target)) %>%
  bind_cols(dat2)

print(dat3)
# # A tibble: 6 x 8
#   Site  Classification transect point Numtrees canopy under   dbh
#   <chr> <chr>             <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl>
# 1 Bala  Primary forest        1     1        4    5.4 12     90  
# 2 Bala  Primary forest        1     2        3    3.8  7.75  90  
# 3 Bala  Primary forest        1     3        6    3.6  2     81.7
# 4 Bala  Primary forest        1     4        1    5.2  8     60  
# 5 Bala  Primary forest        1     5        3    3    3.5   90  
# 6 Bala  Primary forest        2     1        7    4.2 12.5   88.6

Upvotes: 0

jay.sf
jay.sf

Reputation: 73272

Using sapply.

cbind(df[1:4], sapply(c("canopy", "under", "dbh"), function(x) 
  rowMeans(df[grep(x, names(df))], na.rm=TRUE)))
#   Site Classification transect point canopy under      dbh
# 1 Bala Primary forest        1     1    5.4 12.00 90.00000
# 2 Bala Primary forest        1     2    3.8  7.75 90.00000
# 3 Bala Primary forest        1     3    3.6  2.00 81.66667
# 4 Bala Primary forest        1     4    5.2  8.00 60.00000
# 5 Bala Primary forest        1     5    3.0  3.50 90.00000
# 6 Bala Primary forest        2     1    4.2 12.50 88.57143

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 174278

Can't you just do:

df$canopy <- rowMeans(df[grep("^canopy", names(df))])

Or, in tidyverse-speak (and removing intervening columns with select to show the results):

 df %>% 
   mutate(canopy = rowMeans(select(., starts_with("canopy")))) %>%
   select(-(5:24))

#> # A tibble: 6 x 5
#>   Site  Classification transect point canopy
#>  <chr> <chr>             <dbl> <dbl>  <dbl>
#> 1 Bala  Primary forest        1     1    5.4
#> 2 Bala  Primary forest        1     2    3.8
#> 3 Bala  Primary forest        1     3    3.6
#> 4 Bala  Primary forest        1     4    5.2
#> 5 Bala  Primary forest        1     5    3  
#> 6 Bala  Primary forest        2     1    4.2

Upvotes: 1

Related Questions