user3387899
user3387899

Reputation: 611

Extract specific pattern (4 digits, '-', 1digit) from string in R

Imagine following string:

x <- "aokizoizeon1854-1zeoijzeoinq1234-2zeze

I want to extract from this string '1854-1' and '1234-2'. So basically every part of the string that consists of 4 digits followed by a '-' and again one digit.

What is the easiest way to do this?

Upvotes: 2

Views: 2080

Answers (2)

mtoto
mtoto

Reputation: 24188

A possible base solution could be to get rid of all the characters then split the numbers into separate strings:

x1 <- gsub("[a-z]", "", x)
substring(x1, seq(1,nchar(x1),6), seq(6, nchar(x1), 6))
#[1] "1854-1" "1234-2"

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You can use str_extract_all from stringr package with a PCRE regex.

In case your strings can contain digit sequences longer than 4 and you want to just match exactly 4 digit sequences followed with a hyphen and then exactly one digit, you will need lookarounds to precise the matches:

> library(stringr)
> x <- "aokizoizeon1854-1zeoijzeoinq1234-2zeze12445664-345ff"
> unlist(str_extract_all(x, "(?<!\\d)\\d{4}-\\d(?!\\d)"))
[1] "1854-1" "1234-2"

The (?<!\\d)\\d{4}-\\d(?!\\d) regex matches:

  • (?<!\\d) - fail the match if there is a digit before the current position
  • \\d{4}-\\d - match 4 digits followed with a hyphen and 1 digit
  • (?!\\d) - but fail the match if that one digit is followed with another digit.

Upvotes: 2

Related Questions