Reputation: 6874
I have a dataframe as follows called df
OC_ID Bcode Bcode_full Cell Ploidy Goodness_of_fit
OC_DD_0181 LP2000 LP2000-A 56 3 0.45
OC_AD_9787 LP2003 LP2003-B 3 3 0.44
OC_GH_6227 LP2333 LP2333-S 66 3 0.89
I have another dataframe as follows called df2:
chr leftPos Tumour2_OCLLL_DD_0181_SLH.9396.fq.gz Tumour2_OCLLL_DD_09787_SLH.9396.fq.gz Tumour3_OCLLL_GH_6227_SLH.9396.fq.gz Tumour4_OCLLL_GH_6632_SLH.9396.fq.gz Tumour5_OCLLL_WH_6992_SLH.9396.fq.gz
chr1 720916 13.4031903 28.4522464 10.34087 23.4309208 16.239874
chr1 736092 3.4367155 36.7797331 6.893913 58.5773021 59.546204
chr1 818159 108.9438802 109.6452421 78.131014 90.2779596 108.265825
chr1 4105086 114.4426249 103.7466057 59.747246 48.9292758 129.91899
chr1 4140849 23.7133367 0.6939572 45.95942 53.0641442 37.893039
The name of the column in df2 does not correspond exactly to the name of the OC_ID row in df. I want to create a dataframe that only contains those columns from df2 where the value in the Cell column in df1 is >30 so that the expected output is
chr leftPos OC_DD_0181 OC_GH_6227
chr1 720916 13.4031903 10.34087
chr1 736092 3.4367155 6.893913
chr1 818159 108.9438802 78.131014
chr1 4105086 114.4426249 59.747246
chr1 4140849 23.7133367 45.95942
Upvotes: 1
Views: 837
Reputation: 887991
We get the corresponding 'OC_ID' for 'Cell' values greater than 30 ('nm1'). Then use gsub
to remove the substring that are not needed from the column names of 'df2', grep
with the 'nm1' to get the column index, and extract those columns from 'df2'.
nm1 <- df$OC_ID[df$Cell>30]
nm2 <- gsub('.*(OC).*_([A-Z]{2}_\\d+).*', '\\1_\\2', names(df2))
df2N <- df2[c(1:2,grep(paste(nm1, collapse='|'), nm2))]
names(df2N)[3:4] <- nm2
Upvotes: 4
Reputation: 217
Just posting my slow (and at this point irrelevant) answer for completeness. The akrun answer (which I upvoted) seems more elegant though in its utilization of regular expressions.
namevector <- as.character(df$OC_ID[df$Cell > 30])
grepnames <- paste(namevector, collapse="|")
indices <- grep(pattern=grepnames, names(df2))
df3 <- df2[,c(1,2,indices)]
df3 #output df that you requested
So, a character vector is created of the OC_IDs you want from df and are folded into a massive grep pattern string. The indices of matching columns are found in df2. Df3 is created with the two first columns and any matching columns.
Upvotes: 2