Reputation: 249
I want to create ori.same.maf.barcodes
variable to store the strings of ori.maf.barcode
if the substrings before fourth "-" character matches the strings in sub.same.barcodes
.
How sub.same.barcodes
and ori.maf.barcode
were generated. sub.maf.barcode
is the subset of the ori.maf.barcode$Tumor_Sample_Barcode
. The sub.same.barcodes
is the intersect of sub.maf.barcode
and sub.met.barcode
. Now, I want to match sub.same.barcodes
back to ori.maf.barcode
.
ori.maf.barcode <- [email protected]
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).*", "\\1", ori.maf.barcode$Tumor_Sample_Barcode) # Remove the dashes and keep only the first 4
sub.same.barcodes <- intersect(sub.maf.barcode, sub.met.barcode)
Attempt:
ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes
But my code returns "FALSE" instead of a character vector.
dput(ori.maf.barcode[1:20])
structure(list(Tumor_Sample_Barcode = c("TCGA-2K-A9WE-01A-11D-A382-10",
"TCGA-2Z-A9J1-01A-11D-A382-10", "TCGA-2Z-A9J2-01A-11D-A382-10",
"TCGA-2Z-A9J3-01A-12D-A382-10", "TCGA-2Z-A9J5-01A-21D-A382-10",
"TCGA-2Z-A9J6-01A-11D-A382-10", "TCGA-2Z-A9J7-01A-11D-A382-10",
"TCGA-2Z-A9J8-01A-11D-A42J-10", "TCGA-2Z-A9JD-01A-11D-A42J-10",
"TCGA-2Z-A9JG-01A-11D-A42J-10", "TCGA-2Z-A9JI-01A-11D-A42J-10",
"TCGA-2Z-A9JJ-01A-11D-A42J-10", "TCGA-2Z-A9JK-01A-11D-A42J-10",
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-2Z-A9JN-01A-21D-A42J-10",
"TCGA-2Z-A9JO-01A-11D-A42J-10", "TCGA-2Z-A9JQ-01A-11D-A42J-10",
"TCGA-2Z-A9JR-01A-12D-A42J-10", "TCGA-2Z-A9JS-01A-21D-A42J-10",
"TCGA-3Z-A93Z-01A-11D-A36X-10")), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x0000025e377005d0>)
dput(sub.met.barcode[1:20])
c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-01A", "TCGA-UZ-A9PZ-01A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-G7-7502-01A", "TCGA-B1-A47M-11A",
"TCGA-SX-A7SO-01A", "TCGA-HE-A5NJ-01A", "TCGA-MH-A856-01A", "TCGA-A4-8312-01A",
"TCGA-BQ-5892-01A", "TCGA-A4-7732-11A", "TCGA-5P-A9K9-01A", "TCGA-UZ-A9PX-01A",
"TCGA-BQ-7061-01A", "TCGA-BQ-5876-01A", "TCGA-DZ-6134-01A", "TCGA-BQ-5884-01A",
"TCGA-BQ-5889-11A")
Upvotes: 2
Views: 124
Reputation: 887193
We could use sub
to extract the substring till the fourth -
and then use %in%
on the logical vector to subset
i1 <- trimws(sub("^(([^-]+-){4}).*", "\\1", ori.maf.barcode),
whitespace = "-") %in%
sub("^(([^-]+-){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]
-output
> ori.same.maf.barcodes
[1] "TCGA-BQ-7058-01A-11D-1963-05"
[2] "TCGA-2Z-A9JQ-01A-11D-A42K-05"
[3] "TCGA-BQ-5887-11A-01D-1963-05"
Using the new dput in the OP' post, the 'ori.maf.barcode' is a data.table
with column named as 'Tumor_Sample_Barcode'. Extract the column with $
or [[
in base R
or directly use the data.table methods to subset
library(data.table)
ori.maf.barcode[trimws(sub("^(([^-]+-){4}).*", "\\1",
Tumor_Sample_Barcode),
whitespace = "-") %in% sub("^(([^-]+-){4}).*", "\\1", sub.met.barcode)]
Tumor_Sample_Barcode
<char>
1: TCGA-2Z-A9JQ-01A-11D-A42J-10
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05",
"TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06"
)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A",
"TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
Upvotes: 2
Reputation: 1614
Please note that with the sample data you have provided it is not possible for the value TCGA-G7-7502-01A-12D-A43K-06
to appear in the output.
library(stringr)
sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A",
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")
ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")
idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)
Output:
[1] "TCGA-BQ-7058-01A-11D-1963-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05" "TCGA-BQ-5887-11A-01D-1963-05"
Upvotes: 1
Reputation: 4993
Your almost there, but your code ori.maf.barcode %in% sub.same.barcodes
creates the logical equation that returns TRUE
and FALSE
, which is what you are seeing. In order to get back the values which equate to TRUE
you need to pass that expression into a subsetting method to get back what you want.
ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]
If it is a vector this should return another vector with only those entries which are TRUE
in the logical statement.
And you need to string match to get the entries based on the first part as iod said below:
This is a loop picks them out one at a time and adds them to a new vector
new.barcodes<-c()
for (sub in sub.same.barcodes){
new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
new.barcodes<-c(new.barcodes, new)
}
This will iterate through your prefixes and pull out what you want into a new vector
Upvotes: 0