Dswede43
Dswede43

Reputation: 351

How to subset the rows of my data frame based on a list of names?

I have a gene expression count data set with 2 columns and ~60000 rows. Each column is a sample and each row is a gene represented by an ensembl ID. I need to reduce or subset by data to include only genes that are protein coding. Here is a small scale example of what I would like to achieve:

Here is my data set called BDC containing the ensembl ID's of various types of genes along with the count data for each sample:

ENSEMBL           Sample A    Sample B
ENSG00000198888      10          2
ENSG00000210082      3           13
ENSG00000198763      6           18
ENSG00000198886      12          11

I also have a list of ensembl ID's called ProtCod that contain genes that I know are protein coding:

ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763

So I want to subset my data set to only include rows that have a protein coding ensembl ID and exclude all other rows:

ENSEMBL           Sample A    Sample B
ENSG00000198888      10          2
ENSG00000198763      6           18

But I need to achieve this on the large scale reducing my data set from ~60000 to ~20000 rows or genes.

This is what I've tried so far:

BDCProtCod <- BDC[!row.names(BDC) %in% ProtCod, ]
BDCProtCod
dim(BDCProtCod)
[1] 60675 2

The dimensions are the same as my original BDC data set, why isn't this code excluding the rows that don't contain names from ProtCod?

I've also tried:

BDCProtCod <- BDC[unlist(ProtCod), ]
BDCProtCod
dim(BDCProtCod)
[1] 19603 2

This actually excludes the rows I want to be excluded but it sets everything to "NA".

Upvotes: 1

Views: 581

Answers (1)

VitaminB16
VitaminB16

Reputation: 1234

You've deleted your previous post as I was halfway writing the answer.

It seems like the ID is stored as ENSEMBL in BDC and ENSEMBL_Protein_Coding in ProtCod, so to get them as vectors you should call them as BDC$ENSEMBL and ProtCod$ENSEMBL_Protein_Coding respectively

BDC[BDC$ENSEMBL %in% ProtCod$ENSEMBL_Protein_Coding, ]

  ENSEMBL         SampleA SampleB
  <chr>             <dbl>   <dbl>
1 ENSG00000198888      10       2
2 ENSG00000198763       6      18

Data:

require(readr)
BDC = readr::read_table("ENSEMBL           SampleA    SampleB
ENSG00000198888      10          2
ENSG00000210082      3           13
ENSG00000198763      6           18
ENSG00000198886      12          11")
ProtCod = readr::read_table('ENSEMBL_Protein_Coding
ENSG00000198888
ENSG00000198763')

Upvotes: 1

Related Questions