Reputation: 218
I have a dataset of rows of genes with genes also in groups. I am looking to select 1 gene per group into a new dataframe based on a few conditions:
direct_count
secondary_count
I've been trying to use similar questions on here, but I'm having trouble making other examples work for my code with setting up this many conditions.
The data I have looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
4 Gene1 0.1 10 20
4 Gene2 0.68 3 1
4 Gene3 0.7 0 1
Output selection of gene per group:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5 #highest direct_count
2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
3 NOS2 0.6345 1 1
I am trying to use dplyr::group_by()
with if statements currently.
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1", "Gene2", "Gene3"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.1, 0.68, 0.7), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L, 10L, 3L, 0L ), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L, 20L, 1L, 1L)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Edit:
Including sessioninfo and also want to note in my real data some rows have NA for their direct_count
and secondary_count
.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[5] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 tidyr_1.1.2
[9] dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2
[5] dbplyr_1.4.4 tools_4.0.2 jsonlite_1.7.1 lubridate_1.7.9
[9] lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.8
[13] reprex_0.3.0 cli_2.1.0 DBI_1.1.0 rstudioapi_0.11
[17] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2
[21] fs_1.5.0 generics_0.0.2 vctrs_0.3.4 gtools_3.8.2
[25] hms_0.5.3 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[29] R6_2.4.1 fansi_0.4.1 readxl_1.3.1 modelr_0.1.8
[33] blob_1.2.1 magrittr_1.5 backports_1.1.10 scales_1.1.1
[37] ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_1.4-1
[41] stringi_1.5.3 munsell_0.5.0 broom_0.7.2 crayon_1.3.4
Edit problem with selection for real data:
structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1",
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000183dc6b1ef0>)
from this group Gene1
is being selected when it should actually be CHST6
and I can't find why.
Data looks like:
Group Gene Score direct_count secondary_count
1 2 CFDP1 0.5517401 1 62
2 2 CHST6 0.5989186 1 6
3 2 RNU6-758P 0.5644914 0 1
4 2 Gene1 0.5672916 0 1
5 2 TMEM170A 0.6167083 0 2
CHST6 has the highest direct_count
out of all genes <0.05 of the to the top scored gene in this group, yet Gene1 is being selected.
Upvotes: 2
Views: 175
Reputation: 11255
Here's a data.table approach although it would be nicer to have a larger dataset to verify this works.
library(data.table)
dt[,
{
if (.N == 1L)
ind = 1L
else {
o = sort(Score, decreasing = TRUE, index.return = TRUE)
x = o$x
ix = o$ix
if (x[1L] - x[2L] > 0.05)
ind = ix[1L]
else {
search_inds = ix[which(x[1L] - x <= 0.05)]
direct_count_sub = direct_count[search_inds]
wm_dc = which.max(direct_count_sub)
if (sum((dc_max <- direct_count_sub[wm_dc] == direct_count_sub)) == 1L)
ind = search_inds[wm_dc]
else {
secondary_count_sub = secondary_count[search_inds]
wm_sc = which.max(secondary_count_sub)
if (sum((sc_max <- secondary_count_sub[wm_sc] == secondary_count_sub)) == 1L)
ind = search_inds[wm_sc]
else
ind = search_inds[dc_max & sc_max]
}
}
}
.SD[ind]
},
by = Group]
## Group Gene Score direct_count secondary_count
## <int> <char> <num> <int> <int>
##1: 1 AQP11 0.5566507 4 5
##2: 2 CHST6 0.4295135 1 3
##3: 3 ACE 0.6340000 1 1
##4: 3 NOS2 0.6345000 1 1
Note, data.table is very influenced by base. We are largely going through your 4 criteria by group and then subsetting the group based on which case was true.
Upvotes: 0
Reputation: 502
Your data:
> db
# A tibble: 7 x 5
Group `p Gene` Score direct_count secondary_count
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5
2 1 CLNS1A 0.281 0 2
3 1 RSF1 0.547 3 6
4 2 CFDP1 0.419 1 2
5 2 CHST6 0.430 1 3
6 3 ACE 0.634 1 1
7 3 NOS2 0.634 1 1
Now we first write a function such that for a df with only one group it does what want:
gene_selection = function (df) {
if (dim(df)[1]==1) {
return(df)
}
else {
df=arrange(df,-Score)
if((df$Score[1]-df$Score[2])>0.2) {
return(df[1,])
}
else{
if (df$direct_count[1]!=df$direct_count[2]) {
return(df[which.max(df$direct_count[1:2]),])
}
else {
if (df$secondary_count[1]!=df$secondary_count[2]) {
return(df[which.max(df$secondary_count[1:2]),])
}
else {
return(df[1:2,])
}
}
}
}
}
Now using group_map
to implement this function on all groups:
> db%>%
+ mutate(Group2=Group) %>%
+ group_by(Group2) %>%
+ group_map(~gene_selection(.)) %>%
+ bind_rows()
# A tibble: 4 x 5
Group `p Gene` Score direct_count secondary_count
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5
2 2 CHST6 0.430 1 3
3 3 NOS2 0.634 1 1
4 3 ACE 0.634 1 1
Upvotes: 1
Reputation: 77
group_by
and filter
from the tidyverse are your friends here.
library(dplyr)
library(tidyr)
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
Gene = c("AQP11", "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", "data.frame"))
new_df <- df %>%
#first condition
group_by(Group) %>%
mutate(max_score_difference = abs(max(Score)-min(Score))) %>%
filter((max_score_difference > 0.02 & Score == max(Score)) | max_score_difference < 0.02) %>%
# second condition
filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & direct_count == max(direct_count))) %>%
# third condition
filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & secondary_count == max(secondary_count))) %>%
ungroup() %>%
#fourth condition met by max statements in filters above
select(-max_score_difference) %>%
data.frame()
print(new_df)
Upvotes: 1