Reputation: 57
I performed CV calculation over numeric columns of a data frame grouped by a specific column values. Then based on resulting dataframe, I replaced the related column values in first data frame. I did the operation by mixing dplyr functions, merge function, and for loop with if statement. I would like to ask you for a shorter operations of tidyverse functions if any. Here are the steps and example outcome.
I had a data frame- A
Accession Column1 Column2 Column3 Root ID
2000_1 0 0.2 14 2000 1
2000_2 0.01 0.2 17 2000 2
2001_1 0.012 0.22 11 2001 1
2001_2 0.011 0.231 17 2001 2
For this demo I have three numerical columns Column1, Column2, and Column3, but in actual data it is > 500. I subset the data frame to keep only numeric columns (Column1, Column2, Column3, and Root) except ID column. Then I grouped the rows based on column "Root" and calculated the coefficient of variance as a second data frame by using aggregate function/also using tibble functions groupby/summarize_if/ungroup. I replaced the values with NA if they are greater than 30. Here is the resulting dataframe- B
Column1 Column2 Column3 Root
NA 0 13.68 2000
6.14 3.44 NA 2001
I merged data frame A and B by using merge function based on Root column. Resulting data frame was
Column1.x Column2.x Column3.x Accession Column1.y Column2.y Column3.y Root ID
NA 0 13.68 2000_1 0 0.2 14 2000 1
NA 0 13.68 2000_2 0.01 0.2 17 2000 2
6.14 3.44 NA 2001_1 0.012 0.22 11 2001 1
6.14 3.44 NA 2001_2 0.011 0.231 17 2001 2
Then I looped through numeric columns (Column 1, 2, 3). For Root column value 2000, I replaced Column1.y values (0 and 0.01) with Column1.x values are NA. For Root column value 2001, I didn't replace Column1.y values are not NA.
Resulting data frame was
Accession Column1.y Column2.y Column3.y Root ID
2000_1 NA 0.2 14 2000 1
2000_2 NA 0.2 17 2000 2
2001_1 0.012 0.22 NA 2001 1
2001_2 0.011 0.231 NA 2001 2
Upvotes: 2
Views: 477
Reputation: 28826
Data:
dfa <- read.table(text="Accession Column1 Column2 Column3 Root ID
2000_1 0 0.2 14 2000 1
2000_2 0.01 0.2 17 2000 2
2001_1 0.012 0.22 11 2001 1
2001_2 0.011 0.231 17 2001 2", header = T)
Libraries and Functions:
library(tidyverse)
cv <- function(x) 100 * (sd(x) / mean(x))
If we cut to the chase and consider the end result, basically you want to replace the values in Column1:Column3
with NA
if CV is greater than 30. Otherwise, you want to preserve the original values. The code below does that.
dfa %>%
group_by(Root) %>%
mutate_at(vars(Column1:Column3),
list(~ if(cv(.) > 30) NA else .))
Result:
#> # A tibble: 4 x 6
#> Accession Column1 Column2 Column3 Root ID
#> <fct> <dbl> <dbl> <dbl> <int> <int>
#> 1 2000_1 NA 0.2 14 2000 1
#> 2 2000_2 NA 0.2 17 2000 2
#> 3 2001_1 0.012 0.22 NA 2001 1
#> 4 2001_2 0.011 0.231 NA 2001 2
If we want to follow your train of thoughts, then we'll end up with a more complicated code which is illustrated below;
dfa %>%
select_if(function(col) is.numeric(col) & all(col != .$ID)) %>%
group_by(Root) %>%
summarise_each(list(cv)) %>%
mutate_at(vars(Column1:Column3),
list(~ ifelse(. > 30, NA, 0))) %>%
left_join(dfa[,c("Root", "ID")], . , by = "Root") %>%
bind_rows(dfa, .) %>%
group_by(Root, ID) %>%
summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>%
ungroup %>%
select(-ID, -Root, everything())
Explanation:
ID
.Root
.NA
and the rest with 0. I am planing to sum these with the original values as it seems that OP is interested in preserving the NAs (i.e. greater than 30) from this CV matrix but keep the other values unchanged in the original dataset. So summing with 0 keep the latter unchanged while those NAs (na.rm = F
) will affect the values.Root
and ID
.Column1
, Column2
, etc.) by summing the values from original dataframe and modified CV matrix and keeping the first value from other columns (since the original dataframe was first in bind_rows
that means preserving the original values).Another solution would be very similar to what's above, but instead of joining to get the ID column and expand the CV matrix, one can preserve them from the beginning by summarizing as list column and later unnesting them.
dfa %>%
mutate(ID = as.factor(ID)) %>%
group_by(Root) %>%
summarise_each(list(~ if(is.numeric(.)) cv(.) else list(.))) %>%
mutate_at(vars(Column1:Column3),
list(~ ifelse(. > 30, NA, 0))) %>%
unnest(cols = c(Accession, ID)) %>%
mutate(ID = as.integer(ID)) %>%
bind_rows(dfa, .) %>%
group_by(Root, ID) %>%
summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>%
ungroup %>%
select(-ID, -Root, everything())
Upvotes: 3