Reputation: 31
I have been having headaches about this for a while now and I am hoping you can help me out.
I am analysing scRNAseq datasets from mouse kidneys and got Differential Expression data from pseudobulking analyses (DESeq2 package) comparing sick vs healthy mice. To make more sense of the genes, I would like to do a gene set enrichment analysis. I looked up a few tutorials and I believe that I got it to work the way I want to. However, I am a little surprised about the output I get, i.e. I always get a handful (if any) enriched gene sets. I understand that this might be just the way it is, but when I look at these genes with my biological knowledge about these genes, I see clear trends that I expected to show up in the GSEA.
For example, I have an endothelial cell cluster in which I compared sick vs healthy animals and the DE table clearly shows upregulation of genes that are involved with inflammatory responses and leukocyte recruitment. A lot of these pathways do show up in my GSEA results, but they are far from being significant (looking at fdr < 0.05).
Since I am very new to scRNAseq analyses, I am unfamiliar with the standards / common practices in these applications. My research question is which differences I see between sick and healthy mice in different celltypes. So I assumed that it would be reasonable to use only the significantly enriched DE genes for my GSEA and filter the results of my GSEA for only significantly enriched gene sets as well. I am wondering whether my filtering thresholds are too strict and whether it is okay to losen them. I also wondered whether this might be an artifact of my pseudobulking approach. The group of sick mice only have n = 2 so, maybe my DE analysis is too restrictive because of the low number of n. Would it be possible / okay to create 2 pseudoreplicates from one mouse to avoid this and have a compromise between only having 2n and running into the issue of too many pseudoreplicates when looking at each individual cell as a replicate (as Seurat's FindMarkers function does)?.
I would really appreciate your help on this!
Here is the code I am using:
# Download datasets and format
GO <- msigdbr(species = "Mus musculus", category = "C5", subcategory = "GO:BP")
H <- msigdbr(species = "Mus musculus", category = "H")
KEGG <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "CP:KEGG")
global_df <- rbind(GO, H, KEGG)
filtered_df <- global_df %>%
dplyr::select(gs_name, gene_symbol) %>%
# create a single row for each term / name that lists all of the gene ids in a vector
group_by(gs_name)
count <- filtered_df %>% dplyr::count(gs_name) %>% .[.$n > 15 & .$n < 500,]
filtered_gs <- filtered_df %>% subset(.$gs_name %in% count$gs_name)
filtered_gs <- filtered_gs %>%
summarise(all.genes = list(unique(gene_symbol))) %>%
deframe()
#load pseudobulking DE files
files_to_load <- list.files("/home/user/scRNAseq/pdss2/mito_threshold_1/Pod", pattern = "weeks.csv", full.names = TRUE)
analysis_list <- lapply(files_to_load, function(x){
read.csv(x, header = TRUE)
})
names(analysis_list) <- c("12", "21", "6")
# subset for age of interest
data <- analysis_list[["12"]]
# filter for significantly DE genes
data <- data %>% subset(padj < 0.05)
# get named vector of genes
gsea_data <- data.frame(data$gene, data$log2FoldChange)
colnames(gsea_data) <- c("gene", "log2FC")
# format for GSEA (needs a named vector; values = FoldChange, names = genes)
gsea_data_vec <- gsea_data$log2FC
names(gsea_data_vec) <- gsea_data$gene
# set ScoreType ("std" if neg and pos values present; "pos" if only pos; "neg" if only neg)
ScoreType <- if(min(gsea_data_vec) < 0 & max(gsea_data_vec) > 0){
print("std")
} else if(min(gsea_data_vec) < 0 & max(gsea_data_vec) < 0){
print("neg")
} else{
print("pos")
}
# run fGSEA
gsea_out <- fgsea(filtered_gs,
stats = gsea_data_vec,
scoreType = ScoreType,
minSize = 15,
nPermSimple = 10000)`
Here is my sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_3.56.2 clusterProfiler_4.8.1 org.Mm.eg.db_3.17.0 AnnotationDbi_1.62.2 data.table_1.14.8
[6] fgsea_1.26.0 writexl_1.4.2 DESeq2_1.40.2 celldex_1.10.1 scCustomize_1.1.1
[11] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
[16] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
[21] SeuratObject_4.1.3 Seurat_4.3.0.1 msigdbr_7.5.1 SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
[26] Biobase_2.60.0 GenomicRanges_1.52.0 GenomeInfoDb_1.36.1 IRanges_2.34.1 S4Vectors_0.38.1
[31] BiocGenerics_0.46.0 MatrixGenerics_1.12.2 matrixStats_1.0.0
loaded via a namespace (and not attached):
[1] reactable_0.4.4 spatstat.sparse_3.0-2 bitops_1.0-7 enrichplot_1.20.0 HDO.db_0.99.1
[6] httr_1.4.6 webshot_0.5.5 RColorBrewer_1.1-3 tools_4.3.0 sctransform_0.3.5
[11] utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2 uwot_0.1.16 withr_2.5.0
[16] sp_2.0-0 gridExtra_2.3 progressr_0.13.0 cli_3.6.1 spatstat.explore_3.2-1
[21] scatterpie_0.2.1 labeling_0.4.2 spatstat.data_3.0-1 ggridges_0.5.4 pbapply_1.7-2
[26] systemfonts_1.0.4 yulab.utils_0.0.6 gson_0.1.0 DOSE_3.26.1 svglite_2.1.1
[31] parallelly_1.36.0 readxl_1.4.2 rstudioapi_0.14 RSQLite_2.3.1 shape_1.4.6
[36] visNetwork_2.1.2 generics_0.1.3 gridGraphics_0.5-1 ica_1.0-3 spatstat.random_3.1-5
[41] zip_2.3.0 GO.db_3.17.0 Matrix_1.5-4.1 ggbeeswarm_0.7.2 fansi_1.0.4
[46] abind_1.4-5 lifecycle_1.0.3 yaml_2.3.7 snakecase_0.11.0 BiocFileCache_2.8.0
[51] qvalue_2.32.0 Rtsne_0.16 paletteer_1.5.0 grid_4.3.0 blob_1.2.4
[56] promises_1.2.0.1 ExperimentHub_2.8.0 crayon_1.5.2 miniUI_0.1.1.1 lattice_0.21-8
[61] cowplot_1.1.1 KEGGREST_1.40.0 pillar_1.9.0 knitr_1.43 future.apply_1.11.0
[66] codetools_0.2-19 fastmatch_1.1-3 leiden_0.4.3 glue_1.6.2 downloader_0.4
[71] ggfun_0.1.1 vctrs_0.6.3 png_0.1-8 treeio_1.24.1 cellranger_1.1.0
[76] gtable_0.3.3 rematch2_2.1.2 cachem_1.0.8 xfun_0.39 openxlsx_4.2.5.2
[81] S4Arrays_1.0.4 mime_0.12 tidygraph_1.2.3 survival_3.5-5 interactiveDisplayBase_1.38.0
[86] ellipsis_0.3.2 fitdistrplus_1.1-11 ROCR_1.0-11 nlme_3.1-162 ggtree_3.8.0
[91] bit64_4.0.5 filelock_1.0.2 RcppAnnoy_0.0.21 irlba_2.3.5.1 vipor_0.4.5
[96] KernSmooth_2.23-21 colorspace_2.1-0 DBI_1.1.3 ggrastr_1.0.2 tidyselect_1.2.0
[101] curl_5.0.1 bit_4.0.5 compiler_4.3.0 rvest_1.0.3 xml2_1.3.4
[106] DelayedArray_0.26.6 plotly_4.10.2 shadowtext_0.1.2 scales_1.2.1 lmtest_0.9-40
[111] rappdirs_0.3.3 digest_0.6.32 goftest_1.2-3 spatstat.utils_3.0-3 rmarkdown_2.23
[116] XVector_0.40.0 htmltools_0.5.5 pkgconfig_2.0.3 sparseMatrixStats_1.12.2 dbplyr_2.3.2
[121] fastmap_1.1.1 GlobalOptions_0.1.2 rlang_1.1.1 htmlwidgets_1.6.2 DelayedMatrixStats_1.22.1
[126] shiny_1.7.4 farver_2.1.1 zoo_1.8-12 jsonlite_1.8.7 BiocParallel_1.34.2
[131] GOSemSim_2.26.0 RCurl_1.98-1.12 magrittr_2.0.3 kableExtra_1.3.4 GenomeInfoDbData_1.2.10
[136] ggplotify_0.1.1 patchwork_1.1.2 munsell_0.5.0 Rcpp_1.0.10 ape_5.7-1
[141] babelgene_22.9 viridis_0.6.3 reticulate_1.30 stringi_1.7.12 ggraph_2.1.0
[146] zlibbioc_1.46.0 MASS_7.3-60 org.Hs.eg.db_3.17.0 AnnotationHub_3.8.0 plyr_1.8.8
[151] parallel_4.3.0 listenv_0.9.0 ggrepel_0.9.3 deldir_1.0-9 Biostrings_2.68.1
[156] graphlayouts_1.0.0 splines_4.3.0 tensor_1.5 circlize_0.4.15 hms_1.1.3
[161] locfit_1.5-9.8 igraph_1.5.0 spatstat.geom_3.2-1 reshape2_1.4.4 BiocVersion_3.17.1
[166] evaluate_0.21 BiocManager_1.30.21 ggprism_1.0.4 tzdb_0.4.0 tweenr_2.0.2
[171] httpuv_1.6.11 RANN_2.6.1 polyclip_1.10-4 future_1.33.0 scattermore_1.2
[176] ggforce_0.4.1 janitor_2.2.0 xtable_1.8-4 tidytree_0.4.2 later_1.3.1
[181] viridisLite_0.4.2 hypeR_2.0.1 aplot_0.1.10 memoise_2.0.1 beeswarm_0.4.0
[186] cluster_2.1.4 timechange_0.2.0 globals_0.16.2
I tried the following things:
Upvotes: 1
Views: 476
Reputation: 56
With pseudobulk you might have a hard time finding Differentially Expressed Genes, DEG, for GSEA as their p-adjusted value will be rather low.
Perhaps you can use single cell methods to obtain the DEG list. You can do that using the function FindAllMarkers or FindMarkers in Seurat.
This is a walkthrough that uses fgsea https://pnnl-comp-mass-spec.github.io/proteomics-data-analysis-tutorial/gsea.html I am not sure if it is common to filter genes based on adjusted p value before running fgsea, maybe the guide can help.
Upvotes: 0