Milan91
Milan91

Reputation: 21

Regex in R, matching strings

I have strings like this: "X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2" and I would like to match only numbers 1, 2 and 3 in between underscores but without them(underscores). The best solution I could come up with is this str_match(sample_names, "_+[1-3]?") I would really appreciate the help.

Upvotes: 0

Views: 141

Answers (4)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

The simplest method is by using suband backreference:

Data:

d <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")

Solution:

sub(".*_(\\d)_.*", "\\1", d)

Here, (\\d) defines the capturing group for a single number (if the number in question can be more than one digit, use \\d+) that is 'recalled' by the backreference \\1in subs replacement argument

Alternatively use str_extract and positive lookaround:

library(stringr)
str_extract(d, "(?<=_)\\d(?=_)")

(?<=_) is positive lookbehind which can be glossed as "If you see _ on the left..."

\\d is the number to be matched

(?=_) is positive lookahead, which can be glossed as "If you see _ on the right..."

Result:

[1] "1" "2" "3"

Upvotes: 2

Jan
Jan

Reputation: 43169

No need for any third-party module:

strings <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")
pattern <- "(?<=_)(\\d+)(?=_)"

unlist(regmatches(strings, gregexpr(pattern, strings, perl = TRUE)))

Which yields:

[1] "1" "2" "3"

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 269441

Using x in the Note at the end, read it in using read.table and pick off the second field. No packages or regular expressions are used.

read.table(text = x, sep = "_")[[2]]
## [1] 1 2 3

Note

x <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")

Upvotes: 1

Bruno
Bruno

Reputation: 4151

You can use Look Arounds, I personally rely heavily on the stringr Cheatsheets for these kind of regex, the syntax is a bit hard to remember, here is the rstudio page for Cheatsheets look for stringr ->LOOK AROUNDS

library(tidyverse)

codes <- c("X96HE6.10nMBI_1_2", "X96HE6.10nMBI_2_2", "X96HE6.10nMBI_3_2")

codes %>%
  str_extract("(?<=_)[:digit:]+(?=_)")
#> [1] "1" "2" "3"

Created on 2020-06-14 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions