Achal Neupane
Achal Neupane

Reputation: 5719

Remove letters matching pattern before and after the required string

I have a vector with the following elements:

myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")

I want to selectively extract the value after chr and before .recalibrated and get the result.

Result:

10, 11, Y

Upvotes: 0

Views: 547

Answers (4)

Tensibai
Tensibai

Reputation: 15784

Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:

sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)

what the regex does is:

  • .*[.]chr match as much as possible until finding '.chr' literraly
  • ([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
  • [.].* match the rest of the line after a literal dot

I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.

Upvotes: 3

zx8754
zx8754

Reputation: 56054

Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:

for(chrN in c(1:22, "X", "Y")) {
  myVar <- paste0("output.chr", chrN, ".recalibrated")
  #do some fun stuff with myVar 
  print(myVar)
}

Upvotes: 0

akrun
akrun

Reputation: 887028

We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).

 library(stringr)
 str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
 #[1] "10" "11" "Y" 

Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.

 gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
 #[1] "10" "11" "Y" 

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can do that with a mere sub:

> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y" 

The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.

See the regex demo

As an alternative, use str_match:

> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y" 

It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.

The pattern means:

  • chr - match a sequence of literal characters chr
  • (.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
  • \\.recalibrated - .recalibrated literal character sequence.

Upvotes: 7

Related Questions