PesKchan
PesKchan

Reputation: 968

Segregate rows based on positive and negative values across the different columns

I have first column as my variable as rows and their respective values across different column where it can be positive or negative. I would like to filter it based on positive and negative values.

My small subset dataframe

structure(list(gene = c("SCML4", "RASGRP1", "RP1-47M23.3", "TIGIT", 
"IL2RB", "IKZF3"), PC1 = c(0.0976999752508752, 0.0963683648774497, 
0.0958379291214584, 0.095581364305455, 0.0953187100695565, 0.0952640683198088
), PC2 = c(0.0415177491122262, 0.0149616407858333, 0.0592932173696311, 
0.0490135176285661, 0.0666662088855938, 0.0652039968982664), 
    PC3 = c(-0.0480347151614553, -0.05574053153725, -0.04805364872616, 
    -0.0486181477818392, -0.0437832673958965, -0.0450981246281503
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

Data

gene           PC1    PC2     PC3
  <chr>        <dbl>  <dbl>   <dbl>
1 SCML4       0.0977 0.0415 -0.0480
2 RASGRP1     0.0964 0.0150 -0.0557
3 RP1-47M23.3 0.0958 0.0593 -0.0481
4 TIGIT       0.0956 0.0490 -0.0486
5 IL2RB       0.0953 0.0667 -0.0438
6 IKZF3       0.0953 0.0652 -0.0451

I found a way to filter these this is the way

top_genes <- df %>% 
  # select only the PCs we are interested in
  select(gene, PC3) %>%
  # convert to a "long" format
  pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>% 
  # for each PC
  group_by(PC) %>% 
  # arrange by descending order of loading
  arrange(desc(abs(loading))) %>% 
  # take the 10 top rows
  slice(1:10) %>% 
  # pull the gene column as a vector
  pull(gene) %>% 
  # ensure only unique genes are retained
  unique()

top_genes

Now to filter from PC1 or PC2 or PC3 individually which i would like to get as I want to segregate like that only to do that I will have to put

  select(gene, PC3) or select(gene, PC3) or select(gene, PC3).

How do I do that so that I can filter each of the gene and top PC from each column separately in one go instead of putting each PC one by one ?

One of my data which i run since it has more rows

For PC1 I get this

[1] "ZNF550"       "TBC1D32"      "CCDC171"      "ZNF493"       "ZNF749"       "AC004076.5"   "TEX10"        "ZNF573"      
 [9] "ZNF610"       "ZNF891"       "FAM179B"      "ZNF551"       "ZNF84"        "ZNF549"       "CTC-559E9.8"  "RP11-242D8.1"
[17] "STX18-AS1"    "ZNF571"       "ZSCAN30"      "RP11-156P1.3"

For PC2 I get this

"TNRC6B"        "PARP11"        "FBXL4"         "PPT2"          "RPL7P11"       "EMID1"         "AKAP5"         "RPL7P52"      
 [9] "AC003104.1"    "ZMYM3"         "SPNS2"         "EGFL8"         "SENP7"         "RPL7P7"        "ZNF808"        "RP11-358B23.5"
[17] "CTC-444N24.7"  "SOS1"          "ARHGAP23"      "FAM227A" 

For PC3 I get this

"AC090945.1"    "TSC22D1-AS1"   "SV2A"          "TSC22D1"       "MIR133A1HG"    "ZNF792"        "ATP8B2"        "PIBF1"        
 [9] "RP3-415N12.1"  "GPSM1"         "RAB39B"        "PCNXL4"        "NKIRAS1"       "CTD-2368P22.1" "ZNF496"        "ZNF107"       
[17] "TASP1"         "CTD-3224K15.3" "PSD2"          "ZNF138"  

The above genes are obtained for top 20 loading Im taking the abs function. When I ran the same for top 500 loadings I get this. Here as I m taking absolute loadings I do see quite a bit of overlap between my component which as of now I wont need.

I would like to filter for PC1 positive and negative loading same for other PCs as well so that I know which genes/features are here creating the variability across my components as which I need to use downstream for enrichment etc so it would be easier if I could segregate from the beginning.

enter image description here

Upvotes: 1

Views: 84

Answers (2)

TarJae
TarJae

Reputation: 78927

Update after OP request:

This should be no problem: Just remove select(-value) %>% and change last line:


df %>% 
  pivot_longer(-c(gene)) %>% 
  group_split(name) %>% 
  modify(. %>% arrange(-abs(value))) %>% 
  bind_rows() %>% 
  group_by(name) %>% 
  mutate(row = row_number()) %>% 
  pivot_wider(names_from=name, values_from = c(value, gene)) 
    row value_PC1 value_PC2 value_PC3 gene_PC1    gene_PC2    gene_PC3   
  <int>     <dbl>     <dbl>     <dbl> <chr>       <chr>       <chr>      
1     1    0.0977    0.0667   -0.0557 SCML4       IL2RB       RASGRP1    
2     2    0.0964    0.0652   -0.0486 RASGRP1     IKZF3       TIGIT      
3     3    0.0958    0.0593   -0.0481 RP1-47M23.3 RP1-47M23.3 RP1-47M23.3
4     4    0.0956    0.0490   -0.0480 TIGIT       TIGIT       SCML4      
5     5    0.0953    0.0415   -0.0451 IL2RB       SCML4       IKZF3      
6     6    0.0953    0.0150   -0.0438 IKZF3       RASGRP1     IL2RB 

Update: changed modify line: thanks to @Jon Spring (stolen from him):-)

I am still not sure if this is what you are looking for:

But if you try to sort / arrange each column to get the gene order in descending or ascending order depending on PC1 PC2 or PC3 then we could use group_split

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-c(gene)) %>% 
  group_split(name) %>% 
   modify(. %>% arrange(-abs(value))) %>% 
  bind_rows() %>% 
  select(-value) %>% 
  group_by(name) %>% 
  mutate(row = row_number()) %>% 
  pivot_wider(names_from=name, values_from = gene) 
       row PC1         PC2         PC3        
  <int> <chr>       <chr>       <chr>      
1     1 SCML4       IL2RB       RASGRP1    
2     2 RASGRP1     IKZF3       TIGIT      
3     3 RP1-47M23.3 RP1-47M23.3 RP1-47M23.3
4     4 TIGIT       TIGIT       SCML4      
5     5 IL2RB       SCML4       IKZF3      
6     6 IKZF3       RASGRP1     IL2RB 

Upvotes: 2

Jon Spring
Jon Spring

Reputation: 66455

If you remove the select line your code does most of what you seem to be describing:

df_select <- df %>% 
  pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>% 
  group_by(PC) %>% 
  arrange(desc(abs(loading))) %>% 
  slice(1:10) %>%
  ungroup()

Result

# A tibble: 18 × 3
   gene        PC    loading
   <chr>       <chr>   <dbl>
 1 SCML4       PC1    0.0977
 2 RASGRP1     PC1    0.0964
 3 RP1-47M23.3 PC1    0.0958
 4 TIGIT       PC1    0.0956
 5 IL2RB       PC1    0.0953
 6 IKZF3       PC1    0.0953
 7 IL2RB       PC2    0.0667
 8 IKZF3       PC2    0.0652
 9 RP1-47M23.3 PC2    0.0593
10 TIGIT       PC2    0.0490
11 SCML4       PC2    0.0415
12 RASGRP1     PC2    0.0150
13 RASGRP1     PC3   -0.0557
14 TIGIT       PC3   -0.0486
15 RP1-47M23.3 PC3   -0.0481
16 SCML4       PC3   -0.0480
17 IKZF3       PC3   -0.0451
18 IL2RB       PC3   -0.0438

If you want to create objects based on PC:

[code above] %>%
group_split(PC)

then you could get PC1 with df_select[[1]]:

# A tibble: 6 × 3
  gene        PC    loading
  <chr>       <chr>   <dbl>
1 SCML4       PC1    0.0977
2 RASGRP1     PC1    0.0964
3 RP1-47M23.3 PC1    0.0958
4 TIGIT       PC1    0.0956
5 IL2RB       PC1    0.0953
6 IKZF3       PC1    0.0953

or use grps[[1]]$gene to get

[1] "SCML4"       "RASGRP1"     "RP1-47M23.3" "TIGIT"       "IL2RB"       "IKZF3" 

Upvotes: 3

Related Questions