Reputation: 968
I have first column as my variable as rows and their respective values across different column where it can be positive or negative. I would like to filter it based on positive and negative values.
My small subset dataframe
structure(list(gene = c("SCML4", "RASGRP1", "RP1-47M23.3", "TIGIT",
"IL2RB", "IKZF3"), PC1 = c(0.0976999752508752, 0.0963683648774497,
0.0958379291214584, 0.095581364305455, 0.0953187100695565, 0.0952640683198088
), PC2 = c(0.0415177491122262, 0.0149616407858333, 0.0592932173696311,
0.0490135176285661, 0.0666662088855938, 0.0652039968982664),
PC3 = c(-0.0480347151614553, -0.05574053153725, -0.04805364872616,
-0.0486181477818392, -0.0437832673958965, -0.0450981246281503
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Data
gene PC1 PC2 PC3
<chr> <dbl> <dbl> <dbl>
1 SCML4 0.0977 0.0415 -0.0480
2 RASGRP1 0.0964 0.0150 -0.0557
3 RP1-47M23.3 0.0958 0.0593 -0.0481
4 TIGIT 0.0956 0.0490 -0.0486
5 IL2RB 0.0953 0.0667 -0.0438
6 IKZF3 0.0953 0.0652 -0.0451
I found a way to filter these this is the way
top_genes <- df %>%
# select only the PCs we are interested in
select(gene, PC3) %>%
# convert to a "long" format
pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>%
# for each PC
group_by(PC) %>%
# arrange by descending order of loading
arrange(desc(abs(loading))) %>%
# take the 10 top rows
slice(1:10) %>%
# pull the gene column as a vector
pull(gene) %>%
# ensure only unique genes are retained
unique()
top_genes
Now to filter from PC1 or PC2 or PC3 individually which i would like to get as I want to segregate like that only to do that I will have to put
select(gene, PC3) or select(gene, PC3) or select(gene, PC3).
How do I do that so that I can filter each of the gene and top PC from each column separately in one go instead of putting each PC one by one ?
One of my data which i run since it has more rows
For PC1 I get this
[1] "ZNF550" "TBC1D32" "CCDC171" "ZNF493" "ZNF749" "AC004076.5" "TEX10" "ZNF573"
[9] "ZNF610" "ZNF891" "FAM179B" "ZNF551" "ZNF84" "ZNF549" "CTC-559E9.8" "RP11-242D8.1"
[17] "STX18-AS1" "ZNF571" "ZSCAN30" "RP11-156P1.3"
For PC2 I get this
"TNRC6B" "PARP11" "FBXL4" "PPT2" "RPL7P11" "EMID1" "AKAP5" "RPL7P52"
[9] "AC003104.1" "ZMYM3" "SPNS2" "EGFL8" "SENP7" "RPL7P7" "ZNF808" "RP11-358B23.5"
[17] "CTC-444N24.7" "SOS1" "ARHGAP23" "FAM227A"
For PC3 I get this
"AC090945.1" "TSC22D1-AS1" "SV2A" "TSC22D1" "MIR133A1HG" "ZNF792" "ATP8B2" "PIBF1"
[9] "RP3-415N12.1" "GPSM1" "RAB39B" "PCNXL4" "NKIRAS1" "CTD-2368P22.1" "ZNF496" "ZNF107"
[17] "TASP1" "CTD-3224K15.3" "PSD2" "ZNF138"
The above genes are obtained for top 20 loading Im taking the abs
function. When I ran the same for top 500 loadings I get this. Here as I m taking absolute loadings I do see quite a bit of overlap between my component which as of now I wont need.
I would like to filter for PC1 positive and negative loading same for other PCs as well so that I know which genes/features are here creating the variability across my components as which I need to use downstream for enrichment etc so it would be easier if I could segregate from the beginning.
Upvotes: 1
Views: 84
Reputation: 78927
Update after OP request:
This should be no problem: Just remove select(-value) %>%
and change last line:
df %>%
pivot_longer(-c(gene)) %>%
group_split(name) %>%
modify(. %>% arrange(-abs(value))) %>%
bind_rows() %>%
group_by(name) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from=name, values_from = c(value, gene))
row value_PC1 value_PC2 value_PC3 gene_PC1 gene_PC2 gene_PC3
<int> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 1 0.0977 0.0667 -0.0557 SCML4 IL2RB RASGRP1
2 2 0.0964 0.0652 -0.0486 RASGRP1 IKZF3 TIGIT
3 3 0.0958 0.0593 -0.0481 RP1-47M23.3 RP1-47M23.3 RP1-47M23.3
4 4 0.0956 0.0490 -0.0480 TIGIT TIGIT SCML4
5 5 0.0953 0.0415 -0.0451 IL2RB SCML4 IKZF3
6 6 0.0953 0.0150 -0.0438 IKZF3 RASGRP1 IL2RB
Update: changed modify line: thanks to @Jon Spring (stolen from him):-)
I am still not sure if this is what you are looking for:
But if you try to sort / arrange each column to get the gene order in descending or ascending order depending on PC1 PC2 or PC3 then we could use group_split
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-c(gene)) %>%
group_split(name) %>%
modify(. %>% arrange(-abs(value))) %>%
bind_rows() %>%
select(-value) %>%
group_by(name) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from=name, values_from = gene)
row PC1 PC2 PC3
<int> <chr> <chr> <chr>
1 1 SCML4 IL2RB RASGRP1
2 2 RASGRP1 IKZF3 TIGIT
3 3 RP1-47M23.3 RP1-47M23.3 RP1-47M23.3
4 4 TIGIT TIGIT SCML4
5 5 IL2RB SCML4 IKZF3
6 6 IKZF3 RASGRP1 IL2RB
Upvotes: 2
Reputation: 66455
If you remove the select
line your code does most of what you seem to be describing:
df_select <- df %>%
pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>%
group_by(PC) %>%
arrange(desc(abs(loading))) %>%
slice(1:10) %>%
ungroup()
Result
# A tibble: 18 × 3
gene PC loading
<chr> <chr> <dbl>
1 SCML4 PC1 0.0977
2 RASGRP1 PC1 0.0964
3 RP1-47M23.3 PC1 0.0958
4 TIGIT PC1 0.0956
5 IL2RB PC1 0.0953
6 IKZF3 PC1 0.0953
7 IL2RB PC2 0.0667
8 IKZF3 PC2 0.0652
9 RP1-47M23.3 PC2 0.0593
10 TIGIT PC2 0.0490
11 SCML4 PC2 0.0415
12 RASGRP1 PC2 0.0150
13 RASGRP1 PC3 -0.0557
14 TIGIT PC3 -0.0486
15 RP1-47M23.3 PC3 -0.0481
16 SCML4 PC3 -0.0480
17 IKZF3 PC3 -0.0451
18 IL2RB PC3 -0.0438
If you want to create objects based on PC:
[code above] %>%
group_split(PC)
then you could get PC1 with df_select[[1]]
:
# A tibble: 6 × 3
gene PC loading
<chr> <chr> <dbl>
1 SCML4 PC1 0.0977
2 RASGRP1 PC1 0.0964
3 RP1-47M23.3 PC1 0.0958
4 TIGIT PC1 0.0956
5 IL2RB PC1 0.0953
6 IKZF3 PC1 0.0953
or use grps[[1]]$gene
to get
[1] "SCML4" "RASGRP1" "RP1-47M23.3" "TIGIT" "IL2RB" "IKZF3"
Upvotes: 3