DN1
DN1

Reputation: 218

Selecting a row out of a group with conditions?

I have a dataset of rows of genes with genes also in groups. I am looking to select 1 gene per group into a new dataframe based on a few conditions:

  1. Select the gene with the highest score if the score difference between others in the group is >0.02
  2. If the score difference between genes in a group is <0.02 then select a gene with a higher direct_count
  3. If the direct_count is the same select the gene with the highest secondary_count
  4. If everything is the same select both genes.

I've been trying to use similar questions on here, but I'm having trouble making other examples work for my code with setting up this many conditions.

The data I have looks like:

  Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1
    4   Gene1    0.1            10               20
    4   Gene2    0.68            3                1
    4   Gene3    0.7             0                1

Output selection of gene per group:

 Group Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5       #highest direct_count
    2   CHST6    0.4295135       1               3       #highest secondary_count after matching direct_count
    3   ACE      0.634           1               1       #ACE and NOS2 have matching counts
    3   NOS2     0.6345          1               1

I am trying to use dplyr::group_by() with if statements currently.

Input data:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1", "Gene2", "Gene3"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.1, 0.68, 0.7), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L, 10L, 3L, 0L ), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L, 20L, 1L, 1L)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

Edit: Including sessioninfo and also want to note in my real data some rows have NA for their direct_count and secondary_count.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.0     stringr_1.4.0     purrr_0.3.4       readr_1.4.0      
 [5] tibble_3.0.4      ggplot2_3.3.2     tidyverse_1.3.0   tidyr_1.1.2      
 [9] dplyr_1.0.2       data.table_1.13.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       cellranger_1.1.0 pillar_1.4.6     compiler_4.0.2  
 [5] dbplyr_1.4.4     tools_4.0.2      jsonlite_1.7.1   lubridate_1.7.9 
 [9] lifecycle_0.2.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.8     
[13] reprex_0.3.0     cli_2.1.0        DBI_1.1.0        rstudioapi_0.11 
[17] haven_2.3.1      withr_2.3.0      xml2_1.3.2       httr_1.4.2      
[21] fs_1.5.0         generics_0.0.2   vctrs_0.3.4      gtools_3.8.2    
[25] hms_0.5.3        grid_4.0.2       tidyselect_1.1.0 glue_1.4.1      
[29] R6_2.4.1         fansi_0.4.1      readxl_1.3.1     modelr_0.1.8    
[33] blob_1.2.1       magrittr_1.5     backports_1.1.10 scales_1.1.1    
[37] ellipsis_0.3.1   rvest_0.3.6      assertthat_0.2.1 colorspace_1.4-1
[41] stringi_1.5.3    munsell_0.5.0    broom_0.7.2      crayon_1.3.4  

Edit problem with selection for real data:

structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1", 
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502, 
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62, 
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000183dc6b1ef0>)

from this group Gene1 is being selected when it should actually be CHST6 and I can't find why.

Data looks like:

    
   Group Gene         Score      direct_count      secondary_count
1   2    CFDP1        0.5517401        1                  62
2   2    CHST6        0.5989186        1                   6
3   2    RNU6-758P    0.5644914        0                   1
4   2    Gene1        0.5672916        0                   1
5   2    TMEM170A     0.6167083        0                   2

CHST6 has the highest direct_count out of all genes <0.05 of the to the top scored gene in this group, yet Gene1 is being selected.

Upvotes: 2

Views: 175

Answers (3)

Cole
Cole

Reputation: 11255

Here's a approach although it would be nicer to have a larger dataset to verify this works.


library(data.table)

dt[,
   {
     if (.N == 1L) 
       ind = 1L
     else {
      o = sort(Score, decreasing = TRUE, index.return = TRUE)
      x = o$x
      ix = o$ix
      if (x[1L] - x[2L] > 0.05) 
        ind = ix[1L]
      else {
        search_inds = ix[which(x[1L] - x <= 0.05)]
        direct_count_sub = direct_count[search_inds]
        wm_dc = which.max(direct_count_sub)
        
        if (sum((dc_max <- direct_count_sub[wm_dc] == direct_count_sub)) == 1L)
          ind = search_inds[wm_dc]
        else {
          secondary_count_sub = secondary_count[search_inds]
          wm_sc = which.max(secondary_count_sub)
          if (sum((sc_max <- secondary_count_sub[wm_sc] == secondary_count_sub)) == 1L)
            ind = search_inds[wm_sc]
          else
            ind = search_inds[dc_max & sc_max]
        }
      }
     }
     .SD[ind]
   },
   by = Group]

##   Group   Gene     Score direct_count secondary_count
##   <int> <char>     <num>        <int>           <int>
##1:     1  AQP11 0.5566507            4               5
##2:     2  CHST6 0.4295135            1               3
##3:     3    ACE 0.6340000            1               1
##4:     3   NOS2 0.6345000            1               1

Note, is very influenced by base. We are largely going through your 4 criteria by group and then subsetting the group based on which case was true.

Upvotes: 0

Dayne
Dayne

Reputation: 502

Your data:

> db
# A tibble: 7 x 5
  Group `p Gene` Score direct_count secondary_count
  <dbl> <chr>    <dbl>        <dbl>           <dbl>
1     1 AQP11    0.557            4               5
2     1 CLNS1A   0.281            0               2
3     1 RSF1     0.547            3               6
4     2 CFDP1    0.419            1               2
5     2 CHST6    0.430            1               3
6     3 ACE      0.634            1               1
7     3 NOS2     0.634            1               1

Now we first write a function such that for a df with only one group it does what want:

gene_selection = function (df) {
  if (dim(df)[1]==1) {
    return(df)
  }
  else {
    df=arrange(df,-Score)
    if((df$Score[1]-df$Score[2])>0.2) {
      return(df[1,])
    }
    else{
      if (df$direct_count[1]!=df$direct_count[2]) {
        return(df[which.max(df$direct_count[1:2]),])
      }
      else {
        if (df$secondary_count[1]!=df$secondary_count[2]) {
          return(df[which.max(df$secondary_count[1:2]),])
        }
        else {
          return(df[1:2,])
        }
      }
    }
  }  
}

Now using group_map to implement this function on all groups:

> db%>%
+     mutate(Group2=Group) %>%
+     group_by(Group2) %>%
+     group_map(~gene_selection(.)) %>%
+     bind_rows()
# A tibble: 4 x 5
  Group `p Gene` Score direct_count secondary_count
  <dbl> <chr>    <dbl>        <dbl>           <dbl>
1     1 AQP11    0.557            4               5
2     2 CHST6    0.430            1               3
3     3 NOS2     0.634            1               1
4     3 ACE      0.634            1               1

Upvotes: 1

selfawarelemon
selfawarelemon

Reputation: 77

group_by and filter from the tidyverse are your friends here.

library(dplyr)
library(tidyr)


df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), 
Gene = c("AQP11", "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", "data.frame"))
                 
                 
new_df <- df %>%
  #first condition
  group_by(Group) %>%
  mutate(max_score_difference = abs(max(Score)-min(Score))) %>%
  filter((max_score_difference > 0.02 & Score == max(Score)) | max_score_difference < 0.02) %>%
  # second condition
  filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & direct_count == max(direct_count))) %>%
  # third condition
  filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & secondary_count == max(secondary_count))) %>%
  ungroup() %>%
  #fourth condition met by max statements in filters above
  select(-max_score_difference) %>%
  data.frame()

print(new_df)

Upvotes: 1

Related Questions