Tato14
Tato14

Reputation: 435

Keep only the first and the centre part of a string between "_"

I have a file with hundreds of lines like:

>hg38_ct_tbrefGene_6787_NM_006820_utr3_8_0_chr1_78641810_f

For only those lines that have the ">" symbol I want to extract the ">" and the "NM_...". In the example above I want to get:

>NM_006820

I'm not sure if this is important but there is also the possibility that some lines share the output (">NM_006820") more than once.

Tried with gsub() but got completely lost.

Upvotes: 0

Views: 42

Answers (1)

akrun
akrun

Reputation: 887571

We can use sub to match the > at the start (^) of the string, capture as a group ((..)), followed by characters (.*) followed by _ and then capture the two upper case letters followed by _ and one or more numbers (\\d+). In the replacement, we use the backreference of the captured groups

sub("^(>).*_([A-Z]{2}_\\d+).*", "\\1\\2", str1)
#[1] ">NM_006820"                           ">NM_006820"
#[3] "Some text without > at the beginning"

If we need to have all the instance of NM_\\d+, then

library(stringr)
v1 <- sapply(str_extract_all(str1, "^>|[A-Z]{2}_\\d+"), function(x) 
                         toString(paste0(x[1], x[-1]) ))

i1 <- !grepl("^>", v1)
v1[i1] <- str1[i1]

data

str1 <- c(">hg38_ct_tbrefGene_6787_NM_006820_utr3_8_0_chr1_78641810_f", 
     ">hg38_ct_tbrefGene_6787_NM_006820_utr3_NM_006820_8_0_chr1_78641810_f",
   "Some text without > at the beginning")

Upvotes: 2

Related Questions