Reputation: 5719
I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr
and before .recalibrated
and get the result
.
Result:
10, 11, Y
Upvotes: 0
Views: 547
Reputation: 15784
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated
here's my own approach only differing on the regex part with sub
:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr
match as much as possible until finding '.chr' literraly([^.]*)
capture everything not a dot after chr (could be replaced by \\d+
to capture only numeric values, requiring at least one digit present[.].*
match the rest of the line after a literal dotI prefer the character class escape of dots ([.]
) on the backslash escape (\\.
) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
Upvotes: 3
Reputation: 56054
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}
Upvotes: 0
Reputation: 887028
We can use str_extract
to do this. We match one of more characters (.*
) that follow 'chr' ((?<=chr)
) and before the .recalibrated
((?=\\.recalibrated)
).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub
to match the characters until chr
or (|
) that starts from .recalibrated
to the end ($
) of the string and replace it with ''
.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Upvotes: 2
Reputation: 626738
You can do that with a mere sub
:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr
, then matches and captures any characters up to the first .recalibrated
, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1
that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match
:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract
.
The pattern means:
chr
- match a sequence of literal characters chr
(.*?)
- match any characters other than a newline (if you need to match newlines, too, add (?s)
at the beginning of the pattern) up to the first\\.recalibrated
- .recalibrated
literal character sequence.Upvotes: 7