Subsetting data frame by single words occuring in column name

Question

I'm new to stackoverflow and R in general so I hope I don't violate any etiquette :)

So I have quite a big data frame of gene expression levels called expression and I would like to define subsets based on words that occur in the column names.

gene.adk1 gene.adk2 gene.adk3 gene.bas1 gene.bas2   etc
1         2         1         4         6

This is just a small example version of the data frame. What I want to do is define one subset only containing the columns that have "adk" in their title and another subset of the columns containing "bas" in their title

What I did was to sort the column names alphabetically and look at my data frame to find out how many columns there are containing "adk" in their title. I then defined the subset by using the subset function:

adk <- subset.data.frame(expression, select = c(1:3))

Is there a more elegant way of doing this? maybe defining subsets by single words like "adk" in the column name?

Thanks in advance

Marius

akrun · Accepted Answer

We can either use grep to match substring 'adk', 'bas' in the column names to select those columns

adkexprs <- expression[grep('adk', names(expression))]
basexprs <- expression[grep('bas', names(expression))]

Also, to make this more exact match

adkexprs <- expression[grep('^gene\.adk\d+$', names(expression))]
basexprs <- expression[grep('^gene\.bas\d+$', names(expression))]

grep returns the numeric index, while grepl returns logical vector. That is the only difference

adkexprs <- expression[grepl('adk', names(expression))]
basexprs <- expression[grepl('bas', names(expression))]

Or with select from dplyr

library(dplyr)
adkexprs <- expression %>%
      select(matches('adk'))

basexprs <- expression %>%
      select(matches('bas'))

Subsetting data frame by single words occuring in column name

Answers (2)

Related Questions