nat telvyor
nat telvyor

Reputation: 3

How do I subset a data frame based on the values in another data frame?

I have a dataframe where the columns represent patients of various ages, and another dataframe with the values of those ages. I want to subset the data such that patients only below the age of 50 are displayed

> dat
             GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at       179.86300    106.495000     265.58600     301.24300     218.50900     224.61000
31308_at       559.07800    411.483000     481.17600     570.73300     333.53900     370.07900
31309_r_at      20.76970     30.641500      50.21530      42.68920      27.10590      21.57620
31310_at       154.19100    224.446000     188.82300     177.86300     233.46300     120.90800
31311_at       956.79700    648.310000     933.65600    1016.41000     762.01300    1040.29000

And the annotation file with the ages of the patients

> ann
          Gender Age
GSM27015      M  26
GSM27016      M  26
GSM27018      M  29
GSM27021      M  37
GSM27023      M  40
GSM27024      M  42
GSM27025      M  45
GSM27027      M  52
GSM27028      M  53

Upvotes: 0

Views: 74

Answers (4)

akrun
akrun

Reputation: 886938

An option with parse_number

library(stringr)
dat[readr::parse_number(str_remove(names(dat), "^[^.]+\\.")) < 50]

Upvotes: 0

Ben
Ben

Reputation: 30474

Here's something else to consider.

You could transpose your data, so that patients are rows and not columns. As it looks like you have age and gender in your column names, you can also make these additional columns as well.

dat_new <- cbind(do.call(rbind, strsplit(colnames(dat), '\\.')), as.data.frame(t(dat)))
colnames(dat_new)[1:3] <- c("id", "age", "gender")
rownames(dat_new) <- NULL

This is what it would look like:

        id age gender 31307_at 31308_at 31309_r_at 31310_at 31311_at
1 GSM27015  26      M  179.863  559.078    20.7697  154.191  956.797
2 GSM27016  26      M  106.495  411.483    30.6415  224.446  648.310
3 GSM27018  29      M  265.586  481.176    50.2153  188.823  933.656
4 GSM27021  37      M  301.243  570.733    42.6892  177.863 1016.410
5 GSM27023  40      M  218.509  333.539    27.1059  233.463  762.013
6 GSM27024  42      M  224.610  370.079    21.5762  120.908 1040.290

Then, if you wish to subset based on age (e.g., <= 50 years), you can do:

dat_new[dat_new$age <= 50, ]

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101024

Perhaps try

dat[as.numeric(gsub(".*?\\.(\\d+)\\..*","\\1",names(dat)))<50]

Upvotes: 0

Karthik S
Karthik S

Reputation: 11584

Does this work:

> library(dplyr)
> data
           GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M GSM27024.52.M
31307_at        179.8630      106.4950      265.5860      301.2430      218.5090      224.6100       331.230
31308_at        559.0780      411.4830      481.1760      570.7330      333.5390      370.0790       370.079
31309_r_at       20.7697       30.6415       50.2153       42.6892       27.1059       21.5762     98998.000
31310_at        154.1910      224.4460      188.8230      177.8630      233.4630      120.9080       120.908
31311_at        956.7970      648.3100      933.6560     1016.4100      762.0130     1040.2900      1000.290
> data %>% select_if(as.numeric(gsub('GSM\\d{5}\\.(\\d{2})..','\\1',names(data))) < 50)
           GSM27015.26.M GSM27016.26.M GSM27018.29.M GSM27021.37.M GSM27023.40.M GSM27024.42.M
31307_at        179.8630      106.4950      265.5860      301.2430      218.5090      224.6100
31308_at        559.0780      411.4830      481.1760      570.7330      333.5390      370.0790
31309_r_at       20.7697       30.6415       50.2153       42.6892       27.1059       21.5762
31310_at        154.1910      224.4460      188.8230      177.8630      233.4630      120.9080
31311_at        956.7970      648.3100      933.6560     1016.4100      762.0130     1040.2900
> 

So I added one more column to your data "GSM27024.52.M" and in the select output, it wasn't selected.

Upvotes: 0

Related Questions