Jlopes
Jlopes

Reputation: 146

A regex to split a text string in R

I have a very long string like this sample bellow and I'm struggling to find a regex to split it in parts according to the patern, for example: '1. OAS / AC' and '2. OAS / AD'.

This slice of text has:

1) a varying number in the beginning

2) two capital letters varying from A to Z

I tried this:

x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

but not works

Thanks in advance, for any help!

Example

require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

want <- list(
         "1. OAS / AC " = "12345/this is a test string to regex,",
         "2. OAS / AD " = "79856/this is another test string to regex,",
         "3. OAS / AE " = "87987/this is a new test string to regex.",
         "4. OAS / AZ " = "78798456/this is one mode test string to regex."
)

Upvotes: 1

Views: 59

Answers (3)

SKD
SKD

Reputation: 68

They way you described the issue is kinda unclear, but if you want to simply extract till "OAS / AC",

library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.

For the above function to work, the sentences should be individual strings in a character vector

If your aim is to actually insert an "=" sign between the two letter sub-string and the number occurring after "OAS",

gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You may use

library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]

Result:

dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,", 
#  "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
#  ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
#  ))

See the regex demo.

Pattern details

  • (\d+\. OAS / [A-Z]{2}) - Capturing group 1:
    • \d+ - 1+ digits
    • \. - a .
    • OAS / - a literal OAS / substring
    • [A-Z]{2} - two uppercase letters
  • \s* - 0+ whitespaces
  • (.*?) - Capturing group 2: any 0+ chars other than line break chars, as few as possible
  • (?=\s*\d+\. OAS / [A-Z]{2}|\z) - a positive lookahead: immediately to the right of the current location, there must
    • \s*\d+\. OAS / [A-Z]{2} - 0+ whitespaces, 1+ digits, ., space, /, space, two uppercase letters
    • | - or
    • \z - end of string.

Upvotes: 0

Mako212
Mako212

Reputation: 7312

We could do this with a positive lookahead, looking for the pattern of a number, followed by a peroid:

str_split(have, "(?=\\d+\\.)")

[1] ""                                                             "1. OAS / AC 12345/this is a test string to regex, "          
[3] "2. OAS / AD     79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "      
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."

And we can further clean it up:

str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]

[1] "1. OAS / AC 12345/this is a test string to regex, "           "2. OAS / AD     79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. "       "4. OAS / AZ 78798456/this is one mode test string to regex." 

Upvotes: 1

Related Questions