Reputation: 845
I have the following dataframe:
df = data.frame(A=c("CLASS_3", "CLASS_3", "CLASS_1", "CLASS_0", "CLASS_2"), B=c("CLASS_0", "CLASS_1", "CLASS_1", "CLASS_0", "CLASS_3"), C=c("CLASS_0", "CLASS_0", "CLASS_2", "CLASS_0", "CLASS_2"), D=c("CLASS_3", "CLASS_4", "CLASS_2", "CLASS_0", "CLASS_2"),E=c("CLASS_4", "CLASS_4", "CLASS_1", "CLASS_1", "CLASS_2"), F=c("CLASS_3", "CLASS_2", "CLASS_1", "CLASS_0", "CLASS_2"))
row.names(df) <- c("gene1", "gene2", "gene3", "gene4", "gene5")
Every gene
is classified into 5 factors CLASS_0
to CLASS_4
for 6 different conditions (A
to F
).
I want to check whether the CLASS changes from condition to condition, and I am interested in switches from CLASS_0
to CLASS_3
or CLASS_4
- therefore always two conditions/columns are compared. If there is a switch, I want to print the result into two new columns, SWITCH0->3
and SWITCH0->4
.
This is my expected output:
Here, for gene1
, there is a SWITCH0->3
from B to A, B to D, B to F, C to A, C to D, C to F, and a SWITCH0->4
from B to E and C to E.
Using dplyr
, I get all rows that contain CLASS_0
and CLASS_4
, but how do I construct the new column?
df %>% filter_all(any_vars(. %in% c('CAT_1'))) %>% filter_all(any_vars(. %in% c('CAT_3')))
UPDATE: I updated the data with three more cases:
CLASS_0
, CLASS_3
or CLASS_4
in a row (as in gene3
)CLASS_3
or CLASS_4
in a row (as in gene4
)CLASS_0
in a row (as in gene5
).Upvotes: 0
Views: 95
Reputation: 4487
Here is a way to do what you wanted using dplyr
, tidyr
, and purrr
df = data.frame(A=c("CLASS_3", "CLASS_3"), B=c("CLASS_0", "CLASS_1"), C=c("CLASS_0", "CLASS_0"), D=c("CLASS_3", "CLASS_4"),E=c("CLASS_4", "CLASS_4"), F=c("CLASS_3", "CLASS_2"))
row.names(df) <- c("gene1", "gene2")
library(dplyr)
library(tidyr)
library(purrr)
# function to generate the string "origin-dest" combinations
generate_switch_string <- function(origin, dest) {
paste(unlist(map(origin,
paste, sep = "-",
dest)),
collapse = ",")
}
# create column gene base on rowname
df <- df %>% mutate(gene = row.names(.))
combination_df <- df %>%
# create a long df for later use
gather(key = class_name, value = class_value, A:F) %>%
# only keep the class in interest here
filter(class_value %in% c("CLASS_0", "CLASS_3", "CLASS_4")) %>%
group_by(gene) %>%
filter(any(class_value == "CLASS_0") & n_distinct(class_value) > 1) %>%
# group the name of those class together
group_by(gene, class_value) %>%
summarize(class_names = list(class_name), .groups = "drop") %>%
# generate the combination switch using the pre-defined function
group_by(gene) %>%
summarize("switch_0->3" =
generate_switch_string(
unlist(class_names[class_value == "CLASS_0"]),
unlist(class_names[class_value == "CLASS_3"])),
"switch_0->4" =
generate_switch_string(
unlist(class_names[class_value == "CLASS_0"]),
unlist(class_names[class_value == "CLASS_4"])))
combination_df
#> # A tibble: 2 x 3
#> gene `switch_0->3` `switch_0->4`
#> <chr> <chr> <chr>
#> 1 gene1 B-A,B-D,B-F,C-A,C-D,C-F B-E,C-E
#> 2 gene2 C-A C-D,C-E
df %>% left_join(combination_df, by = "gene")
#> A B C D E F gene switch_0->3
#> 1 CLASS_3 CLASS_0 CLASS_0 CLASS_3 CLASS_4 CLASS_3 gene1 B-A,B-D,B-F,C-A,C-D,C-F
#> 2 CLASS_3 CLASS_1 CLASS_0 CLASS_4 CLASS_4 CLASS_2 gene2 C-A
#> 3 CLASS_1 CLASS_1 CLASS_2 CLASS_2 CLASS_1 CLASS_1 gene3 <NA>
#> 4 CLASS_0 CLASS_0 CLASS_0 CLASS_0 CLASS_1 CLASS_0 gene4 <NA>
#> 5 CLASS_2 CLASS_3 CLASS_2 CLASS_2 CLASS_2 CLASS_2 gene5 <NA>
#> switch_0->4
#> 1 B-E,C-E
#> 2 C-D,C-E
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
Created on 2022-01-05 by the reprex package (v2.0.1)
Upvotes: 1