Dalmuti71
Dalmuti71

Reputation: 1629

Calculating row means without having to provide column names and selectively removing columns based on each columns sum

The following data set is given:

library(tidyverse)
# example data
df1 = data.frame(ID = c("daisy", "lily", "rose", "tulip", "poppy", "iris", "orchid", "lotus", "crocus"), 
                 loc1 = c(10, 20, 30, 40, 50, 60, 70, 80, 90),
                 loc2 = c(100, 200, 300, 400, 500, 600, 700, 800, 900), 
                 loc3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0), 
                 loc4 = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000))

Problem 1: For each row, extract minimum value, calculate mean, and append both results to data set. Works with the below code:

df1 %>%  
  rowwise() %>% 
  mutate(Min = min(c(loc1, loc2, loc3, loc4)), Mean = mean(c(loc1, loc2, loc3, loc4)))

How can I make the code more generic so it can be applied to all columns in a data set that, e.g. do not contain factors or strings? I want to avoid having to type in column names when I have 100+ columns. I tried the following:

df1 %>%  
  rowwise() %>% 
  mutate(Min =  min(is_double(df1)), Mean = mean(is_double(df1)))

but it does not yield the desired result:

ID      loc1  loc2  loc3  loc4 Median  Mean
  <fct>  <dbl> <dbl> <dbl> <dbl> <lgl>  <dbl>
1 daisy     10   100     0  1000 FALSE      0
2 lily      20   200     0  2000 FALSE      0
3 rose      30   300     0  3000 FALSE      0

How can this be solved?

Problem 2: How can I calculate the sum of the values in each column, then remove columns where sum = 0, while saving the names of the columns that were removed in a vector or data frame (so I can save the names to a file)?

I tried

cs <- colSums(df1[,2:ncol(df1)])
df1 %>% 
  select(which(cs > 0))

but get a very strange result where column names are shifted and the column containing only '0' is retained.

loc1 loc2 loc4
1  daisy   10    0
2   lily   20    0
3   rose   30    0
4  tulip   40    0

Any suggestions?

Many thanks!

Upvotes: 2

Views: 367

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389135

Answer to Problem 1 :

We can use pmap_dbl to apply the function per row and use select_if to select columns which are not factor or character.

library(dplyr)
library(purrr)

df1 %>%
  mutate(Min = pmap_dbl(select_if(., ~!(is.factor(.) | is.character(.))), min),
         Mean = pmap_dbl(select_if(., ~!(is.factor(.) | is.character(.))), 
                                      ~mean(c(...))))

#      ID loc1 loc2 loc3 loc4 Min   Mean
#1  daisy   10  100    0 1000   0  277.5
#2   lily   20  200    0 2000   0  555.0
#3   rose   30  300    0 3000   0  832.5
#4  tulip   40  400    0 4000   0 1110.0
#5  poppy   50  500    0 5000   0 1387.5
#6   iris   60  600    0 6000   0 1665.0
#7 orchid   70  700    0 7000   0 1942.5
#8  lotus   80  800    0 8000   0 2220.0
#9 crocus   90  900    0 9000   0 2497.5

Answer to Problem 2 :

We can use summarise_if to sum all numeric columns, select the columns which has sum as 0 and save their name in removed_cols.

removed_cols <- df1 %>%
                 summarise_if(is.numeric, sum) %>%
                 select_if(~. == 0) %>%
                 names
removed_cols
#[1] "loc3"

df1 %>%
  summarise_if(is.numeric, sum) %>%
  select_if(~. != 0) 

#  loc1 loc2  loc4
#1  450 4500 45000

Upvotes: 1

akrun
akrun

Reputation: 887501

With select_if we can select the numeric columns

library(dplyr)
library(matrixStats)
df1 %>%
    mutate(Median = select_if(., is.numeric) %>% 
                               as.matrix %>% 
                              rowMedians, 
           Mean =select_if(., is.numeric) %>% 
                        rowMeans )

Or convert to 'long' format and then do the group by row

library(dplyr)
library(tidyr)
df1 %>% 
   select_if(is.numeric) %>%
   mutate(rn = row_number()) %>%
   pivot_longer(cols = -rn) %>%
   group_by(rn) %>%
   summarise(Median = median(value), Mean = mean(value), Min = min(value)) %>%
   select(-rn) %>% 
   bind_cols(df1, .)
#      ID loc1 loc2 loc3 loc4 Median   Mean Min
#1  daisy   10  100    0 1000     55  277.5   0
#2   lily   20  200    0 2000    110  555.0   0
#3   rose   30  300    0 3000    165  832.5   0
#4  tulip   40  400    0 4000    220 1110.0   0
#5  poppy   50  500    0 5000    275 1387.5   0
#6   iris   60  600    0 6000    330 1665.0   0
#7 orchid   70  700    0 7000    385 1942.5   0
#8  lotus   80  800    0 8000    440 2220.0   0
#9 crocus   90  900    0 9000    495 2497.5   0

To get the sum of numeric columns and with a condition to sum only if the column sum is greater than 0

df1 %>% 
     summarise_if(~is.numeric(.) && sum(.) > 0, sum)
#  loc1 loc2  loc4
#1  450 4500 45000

Or using base R

Filter(sum, colSums(df1[-1]))
#  loc1  loc2  loc4 
#   450  4500 45000 

If the intention is to select the columns with sum > 0 and numeric, then use select_if

df1 %>% 
   select_if(~ is.numeric(.) && sum(.) > 0)
#  loc1 loc2 loc4
#1   10  100 1000
#2   20  200 2000
#3   30  300 3000
#4   40  400 4000
#5   50  500 5000
#6   60  600 6000
#7   70  700 7000
#8   80  800 8000
#9   90  900 9000

Or to include the first column factor as well

df1 %>% 
    select_if(~ is.factor(.)|(is.numeric(.) && sum(.) > 0))
#      ID loc1 loc2 loc4
#1  daisy   10  100 1000
#2   lily   20  200 2000
#3   rose   30  300 3000
#4  tulip   40  400 4000
#5  poppy   50  500 5000
#6   iris   60  600 6000
#7 orchid   70  700 7000
#8  lotus   80  800 8000
#9 crocus   90  900 9000

Or using the OP's code, we add + 1 to it because the cs was created by removing the first column

df1 %>% 
      select(which(cs > 0)+1)

Including the first column

df1 %>% 
     select(1, which(cs > 0)+1)

Or remove the first column from 'df1' and then use the code from the OP's post

df1 %>%
  select(-1) %>%
  select( which(cs > 0))

Upvotes: 3

Related Questions