melolilili
melolilili

Reputation: 249

How to subset a character vector based on substring matches?

I want to create ori.same.maf.barcodes variable to store the strings of ori.maf.barcode if the substrings before fourth "-" character matches the strings in sub.same.barcodes.

How sub.same.barcodes and ori.maf.barcode were generated. sub.maf.barcode is the subset of the ori.maf.barcode$Tumor_Sample_Barcode. The sub.same.barcodes is the intersect of sub.maf.barcode and sub.met.barcode. Now, I want to match sub.same.barcodes back to ori.maf.barcode.

ori.maf.barcode <- [email protected]
sub.maf.barcode <- gsub("^([^-]*-[^-]*-[^-]*-[^-]*).*", "\\1", ori.maf.barcode$Tumor_Sample_Barcode) # Remove the dashes and keep only the first 4 
sub.same.barcodes <- intersect(sub.maf.barcode, sub.met.barcode)

Attempt:

ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes

But my code returns "FALSE" instead of a character vector.

dput(ori.maf.barcode[1:20])

structure(list(Tumor_Sample_Barcode = c("TCGA-2K-A9WE-01A-11D-A382-10", 
"TCGA-2Z-A9J1-01A-11D-A382-10", "TCGA-2Z-A9J2-01A-11D-A382-10", 
"TCGA-2Z-A9J3-01A-12D-A382-10", "TCGA-2Z-A9J5-01A-21D-A382-10", 
"TCGA-2Z-A9J6-01A-11D-A382-10", "TCGA-2Z-A9J7-01A-11D-A382-10", 
"TCGA-2Z-A9J8-01A-11D-A42J-10", "TCGA-2Z-A9JD-01A-11D-A42J-10", 
"TCGA-2Z-A9JG-01A-11D-A42J-10", "TCGA-2Z-A9JI-01A-11D-A42J-10", 
"TCGA-2Z-A9JJ-01A-11D-A42J-10", "TCGA-2Z-A9JK-01A-11D-A42J-10", 
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-2Z-A9JN-01A-21D-A42J-10", 
"TCGA-2Z-A9JO-01A-11D-A42J-10", "TCGA-2Z-A9JQ-01A-11D-A42J-10", 
"TCGA-2Z-A9JR-01A-12D-A42J-10", "TCGA-2Z-A9JS-01A-21D-A42J-10", 
"TCGA-3Z-A93Z-01A-11D-A36X-10")), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x0000025e377005d0>)

dput(sub.met.barcode[1:20])

c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-01A", "TCGA-UZ-A9PZ-01A", 
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-G7-7502-01A", "TCGA-B1-A47M-11A", 
"TCGA-SX-A7SO-01A", "TCGA-HE-A5NJ-01A", "TCGA-MH-A856-01A", "TCGA-A4-8312-01A", 
"TCGA-BQ-5892-01A", "TCGA-A4-7732-11A", "TCGA-5P-A9K9-01A", "TCGA-UZ-A9PX-01A", 
"TCGA-BQ-7061-01A", "TCGA-BQ-5876-01A", "TCGA-DZ-6134-01A", "TCGA-BQ-5884-01A", 
"TCGA-BQ-5889-11A")

Upvotes: 2

Views: 124

Answers (3)

akrun
akrun

Reputation: 887193

We could use sub to extract the substring till the fourth - and then use %in% on the logical vector to subset

i1 <- trimws(sub("^(([^-]+-){4}).*", "\\1", ori.maf.barcode), 
         whitespace = "-") %in%  
       sub("^(([^-]+-){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]

-output

> ori.same.maf.barcodes
[1] "TCGA-BQ-7058-01A-11D-1963-05" 
[2] "TCGA-2Z-A9JQ-01A-11D-A42K-05" 
[3] "TCGA-BQ-5887-11A-01D-1963-05"

update

Using the new dput in the OP' post, the 'ori.maf.barcode' is a data.table with column named as 'Tumor_Sample_Barcode'. Extract the column with $ or [[ in base R or directly use the data.table methods to subset

library(data.table)
ori.maf.barcode[trimws(sub("^(([^-]+-){4}).*", "\\1", 
   Tumor_Sample_Barcode), 
          whitespace = "-") %in% sub("^(([^-]+-){4}).*", "\\1", sub.met.barcode)]
           Tumor_Sample_Barcode
                         <char>
1: TCGA-2Z-A9JQ-01A-11D-A42J-10

data

ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05",
  "TCGA-DZ-6131-01A-11D-1963-05", 
"TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05", 
"TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06"
)

 sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", 
"TCGA-UZ-A9PZ-03A", 
"TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")

Upvotes: 2

br00t
br00t

Reputation: 1614

Please note that with the sample data you have provided it is not possible for the value TCGA-G7-7502-01A-12D-A43K-06 to appear in the output.

library(stringr)

sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A", 
                       "TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")

ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
                     "TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
                     "TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")

idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)

Output:

[1] "TCGA-BQ-7058-01A-11D-1963-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05" "TCGA-BQ-5887-11A-01D-1963-05"

Upvotes: 1

sconfluentus
sconfluentus

Reputation: 4993

Your almost there, but your code ori.maf.barcode %in% sub.same.barcodes creates the logical equation that returns TRUE and FALSE, which is what you are seeing. In order to get back the values which equate to TRUE you need to pass that expression into a subsetting method to get back what you want.

ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]

If it is a vector this should return another vector with only those entries which are TRUE in the logical statement.

And you need to string match to get the entries based on the first part as iod said below:

This is a loop picks them out one at a time and adds them to a new vector

new.barcodes<-c()
for (sub in sub.same.barcodes){
  new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
  new.barcodes<-c(new.barcodes, new)
}

This will iterate through your prefixes and pull out what you want into a new vector

Upvotes: 0

Related Questions