Reputation: 351
I have a gene expression count data set with 2 columns and ~60000 rows. Each column is a sample and each row is a gene represented by an ensembl ID. I need to reduce or subset by data to include only genes that are protein coding. Here is a small scale example of what I would like to achieve:
Here is my data set called BDC
containing the ensembl ID's of various types of genes along with the count data for each sample:
ENSEMBL Sample A Sample B
ENSG00000198888 10 2
ENSG00000210082 3 13
ENSG00000198763 6 18
ENSG00000198886 12 11
I also have a list of ensembl ID's called ProtCod
that contain genes that I know are protein coding:
ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763
So I want to subset my data set to only include rows that have a protein coding ensembl ID and exclude all other rows:
ENSEMBL Sample A Sample B
ENSG00000198888 10 2
ENSG00000198763 6 18
But I need to achieve this on the large scale reducing my data set from ~60000 to ~20000 rows or genes.
This is what I've tried so far:
BDCProtCod <- BDC[!row.names(BDC) %in% ProtCod, ]
BDCProtCod
dim(BDCProtCod)
[1] 60675 2
The dimensions are the same as my original BDC
data set, why isn't this code excluding the rows that don't contain names from ProtCod
?
I've also tried:
BDCProtCod <- BDC[unlist(ProtCod), ]
BDCProtCod
dim(BDCProtCod)
[1] 19603 2
This actually excludes the rows I want to be excluded but it sets everything to "NA".
Upvotes: 1
Views: 581
Reputation: 1234
You've deleted your previous post as I was halfway writing the answer.
It seems like the ID is stored as ENSEMBL
in BDC
and ENSEMBL_Protein_Coding
in ProtCod
, so to get them as vectors you should call them as BDC$ENSEMBL
and ProtCod$ENSEMBL_Protein_Coding
respectively
BDC[BDC$ENSEMBL %in% ProtCod$ENSEMBL_Protein_Coding, ]
ENSEMBL SampleA SampleB
<chr> <dbl> <dbl>
1 ENSG00000198888 10 2
2 ENSG00000198763 6 18
Data:
require(readr)
BDC = readr::read_table("ENSEMBL SampleA SampleB
ENSG00000198888 10 2
ENSG00000210082 3 13
ENSG00000198763 6 18
ENSG00000198886 12 11")
ProtCod = readr::read_table('ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763')
Upvotes: 1