Reputation: 435
I have a file with hundreds of lines like:
>hg38_ct_tbrefGene_6787_NM_006820_utr3_8_0_chr1_78641810_f
For only those lines that have the ">" symbol I want to extract the ">" and the "NM_...". In the example above I want to get:
>NM_006820
I'm not sure if this is important but there is also the possibility that some lines share the output (">NM_006820") more than once.
Tried with gsub()
but got completely lost.
Upvotes: 0
Views: 42
Reputation: 887571
We can use sub
to match the >
at the start (^
) of the string, capture as a group ((..)
), followed by characters (.*
) followed by _
and then capture the two upper case letters followed by _
and one or more numbers (\\d+
). In the replacement, we use the backreference of the captured groups
sub("^(>).*_([A-Z]{2}_\\d+).*", "\\1\\2", str1)
#[1] ">NM_006820" ">NM_006820"
#[3] "Some text without > at the beginning"
If we need to have all the instance of NM_\\d+
, then
library(stringr)
v1 <- sapply(str_extract_all(str1, "^>|[A-Z]{2}_\\d+"), function(x)
toString(paste0(x[1], x[-1]) ))
i1 <- !grepl("^>", v1)
v1[i1] <- str1[i1]
str1 <- c(">hg38_ct_tbrefGene_6787_NM_006820_utr3_8_0_chr1_78641810_f",
">hg38_ct_tbrefGene_6787_NM_006820_utr3_NM_006820_8_0_chr1_78641810_f",
"Some text without > at the beginning")
Upvotes: 2