Sebastian Zeki
Sebastian Zeki

Reputation: 6874

Filter rows in one dataframe based on columns in another dataframe in r

I have a dataframe as follows called df

OC_ID           Bcode  Bcode_full Cell   Ploidy Goodness_of_fit
OC_DD_0181     LP2000  LP2000-A    56      3      0.45
OC_AD_9787     LP2003  LP2003-B    3       3      0.44
OC_GH_6227     LP2333  LP2333-S    66      3      0.89

I have another dataframe as follows called df2:

chr   leftPos    Tumour2_OCLLL_DD_0181_SLH.9396.fq.gz    Tumour2_OCLLL_DD_09787_SLH.9396.fq.gz   Tumour3_OCLLL_GH_6227_SLH.9396.fq.gz     Tumour4_OCLLL_GH_6632_SLH.9396.fq.gz   Tumour5_OCLLL_WH_6992_SLH.9396.fq.gz
chr1    720916  13.4031903  28.4522464  10.34087    23.4309208  16.239874
chr1    736092  3.4367155   36.7797331  6.893913    58.5773021  59.546204
chr1    818159  108.9438802 109.6452421 78.131014   90.2779596  108.265825
chr1    4105086 114.4426249 103.7466057 59.747246   48.9292758  129.91899
chr1    4140849 23.7133367  0.6939572   45.95942    53.0641442  37.893039

The name of the column in df2 does not correspond exactly to the name of the OC_ID row in df. I want to create a dataframe that only contains those columns from df2 where the value in the Cell column in df1 is >30 so that the expected output is

chr   leftPos    OC_DD_0181  OC_GH_6227
chr1    720916  13.4031903  10.34087
chr1    736092  3.4367155   6.893913
chr1    818159  108.9438802 78.131014
chr1    4105086 114.4426249 59.747246
chr1    4140849 23.7133367  45.95942

Upvotes: 1

Views: 837

Answers (2)

akrun
akrun

Reputation: 887991

We get the corresponding 'OC_ID' for 'Cell' values greater than 30 ('nm1'). Then use gsub to remove the substring that are not needed from the column names of 'df2', grep with the 'nm1' to get the column index, and extract those columns from 'df2'.

nm1 <-  df$OC_ID[df$Cell>30]
nm2 <- gsub('.*(OC).*_([A-Z]{2}_\\d+).*', '\\1_\\2', names(df2))
df2N <- df2[c(1:2,grep(paste(nm1, collapse='|'), nm2))]
names(df2N)[3:4] <- nm2

Upvotes: 4

CarlAH
CarlAH

Reputation: 217

Just posting my slow (and at this point irrelevant) answer for completeness. The akrun answer (which I upvoted) seems more elegant though in its utilization of regular expressions.

namevector <- as.character(df$OC_ID[df$Cell > 30])
grepnames <- paste(namevector, collapse="|")
indices <- grep(pattern=grepnames, names(df2))
df3 <- df2[,c(1,2,indices)]
df3 #output df that you requested

So, a character vector is created of the OC_IDs you want from df and are folded into a massive grep pattern string. The indices of matching columns are found in df2. Df3 is created with the two first columns and any matching columns.

Upvotes: 2

Related Questions