Reputation: 1629
The following data set is given:
library(tidyverse)
# example data
df1 = data.frame(ID = c("daisy", "lily", "rose", "tulip", "poppy", "iris", "orchid", "lotus", "crocus"),
loc1 = c(10, 20, 30, 40, 50, 60, 70, 80, 90),
loc2 = c(100, 200, 300, 400, 500, 600, 700, 800, 900),
loc3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
loc4 = c(1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000))
Problem 1: For each row, extract minimum value, calculate mean, and append both results to data set. Works with the below code:
df1 %>%
rowwise() %>%
mutate(Min = min(c(loc1, loc2, loc3, loc4)), Mean = mean(c(loc1, loc2, loc3, loc4)))
How can I make the code more generic so it can be applied to all columns in a data set that, e.g. do not contain factors or strings? I want to avoid having to type in column names when I have 100+ columns. I tried the following:
df1 %>%
rowwise() %>%
mutate(Min = min(is_double(df1)), Mean = mean(is_double(df1)))
but it does not yield the desired result:
ID loc1 loc2 loc3 loc4 Median Mean
<fct> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
1 daisy 10 100 0 1000 FALSE 0
2 lily 20 200 0 2000 FALSE 0
3 rose 30 300 0 3000 FALSE 0
How can this be solved?
Problem 2: How can I calculate the sum of the values in each column, then remove columns where sum = 0, while saving the names of the columns that were removed in a vector or data frame (so I can save the names to a file)?
I tried
cs <- colSums(df1[,2:ncol(df1)])
df1 %>%
select(which(cs > 0))
but get a very strange result where column names are shifted and the column containing only '0' is retained.
loc1 loc2 loc4
1 daisy 10 0
2 lily 20 0
3 rose 30 0
4 tulip 40 0
Any suggestions?
Many thanks!
Upvotes: 2
Views: 367
Reputation: 389135
Answer to Problem 1 :
We can use pmap_dbl
to apply the function per row and use select_if
to select columns which are not factor
or character
.
library(dplyr)
library(purrr)
df1 %>%
mutate(Min = pmap_dbl(select_if(., ~!(is.factor(.) | is.character(.))), min),
Mean = pmap_dbl(select_if(., ~!(is.factor(.) | is.character(.))),
~mean(c(...))))
# ID loc1 loc2 loc3 loc4 Min Mean
#1 daisy 10 100 0 1000 0 277.5
#2 lily 20 200 0 2000 0 555.0
#3 rose 30 300 0 3000 0 832.5
#4 tulip 40 400 0 4000 0 1110.0
#5 poppy 50 500 0 5000 0 1387.5
#6 iris 60 600 0 6000 0 1665.0
#7 orchid 70 700 0 7000 0 1942.5
#8 lotus 80 800 0 8000 0 2220.0
#9 crocus 90 900 0 9000 0 2497.5
Answer to Problem 2 :
We can use summarise_if
to sum
all numeric columns, select the columns which has sum as 0 and save their name in removed_cols
.
removed_cols <- df1 %>%
summarise_if(is.numeric, sum) %>%
select_if(~. == 0) %>%
names
removed_cols
#[1] "loc3"
df1 %>%
summarise_if(is.numeric, sum) %>%
select_if(~. != 0)
# loc1 loc2 loc4
#1 450 4500 45000
Upvotes: 1
Reputation: 887501
With select_if
we can select the numeric columns
library(dplyr)
library(matrixStats)
df1 %>%
mutate(Median = select_if(., is.numeric) %>%
as.matrix %>%
rowMedians,
Mean =select_if(., is.numeric) %>%
rowMeans )
Or convert to 'long' format and then do the group by row
library(dplyr)
library(tidyr)
df1 %>%
select_if(is.numeric) %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
group_by(rn) %>%
summarise(Median = median(value), Mean = mean(value), Min = min(value)) %>%
select(-rn) %>%
bind_cols(df1, .)
# ID loc1 loc2 loc3 loc4 Median Mean Min
#1 daisy 10 100 0 1000 55 277.5 0
#2 lily 20 200 0 2000 110 555.0 0
#3 rose 30 300 0 3000 165 832.5 0
#4 tulip 40 400 0 4000 220 1110.0 0
#5 poppy 50 500 0 5000 275 1387.5 0
#6 iris 60 600 0 6000 330 1665.0 0
#7 orchid 70 700 0 7000 385 1942.5 0
#8 lotus 80 800 0 8000 440 2220.0 0
#9 crocus 90 900 0 9000 495 2497.5 0
To get the sum of numeric
columns and with a condition to sum
only if the column sum is greater than 0
df1 %>%
summarise_if(~is.numeric(.) && sum(.) > 0, sum)
# loc1 loc2 loc4
#1 450 4500 45000
Or using base R
Filter(sum, colSums(df1[-1]))
# loc1 loc2 loc4
# 450 4500 45000
If the intention is to select
the columns with sum
> 0 and numeric
, then use select_if
df1 %>%
select_if(~ is.numeric(.) && sum(.) > 0)
# loc1 loc2 loc4
#1 10 100 1000
#2 20 200 2000
#3 30 300 3000
#4 40 400 4000
#5 50 500 5000
#6 60 600 6000
#7 70 700 7000
#8 80 800 8000
#9 90 900 9000
Or to include the first column factor
as well
df1 %>%
select_if(~ is.factor(.)|(is.numeric(.) && sum(.) > 0))
# ID loc1 loc2 loc4
#1 daisy 10 100 1000
#2 lily 20 200 2000
#3 rose 30 300 3000
#4 tulip 40 400 4000
#5 poppy 50 500 5000
#6 iris 60 600 6000
#7 orchid 70 700 7000
#8 lotus 80 800 8000
#9 crocus 90 900 9000
Or using the OP's code, we add + 1
to it because the cs
was created by removing the first column
df1 %>%
select(which(cs > 0)+1)
Including the first column
df1 %>%
select(1, which(cs > 0)+1)
Or remove the first column from 'df1' and then use the code from the OP's post
df1 %>%
select(-1) %>%
select( which(cs > 0))
Upvotes: 3