Reputation: 146
I have a very long string like this sample bellow and I'm struggling to find a regex to split it in parts according to the patern, for example: '1. OAS / AC' and '2. OAS / AD'.
This slice of text has:
1) a varying number in the beginning
2) two capital letters varying from A to Z
I tried this:
x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
but not works
Thanks in advance, for any help!
Example
require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
want <- list(
"1. OAS / AC " = "12345/this is a test string to regex,",
"2. OAS / AD " = "79856/this is another test string to regex,",
"3. OAS / AE " = "87987/this is a new test string to regex.",
"4. OAS / AZ " = "78798456/this is one mode test string to regex."
)
Upvotes: 1
Views: 59
Reputation: 68
They way you described the issue is kinda unclear, but if you want to simply extract till "OAS / AC"
,
library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.
For the above function to work, the sentences should be individual strings in a character vector
If your aim is to actually insert an "="
sign between the two letter sub-string and the number occurring after "OAS"
,
gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)
Upvotes: 0
Reputation: 627536
You may use
library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]
Result:
dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,",
# "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
# ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
# ))
See the regex demo.
Pattern details
(\d+\. OAS / [A-Z]{2})
- Capturing group 1:
\d+
- 1+ digits\.
- a .
OAS /
- a literal OAS /
substring[A-Z]{2}
- two uppercase letters\s*
- 0+ whitespaces(.*?)
- Capturing group 2: any 0+ chars other than line break chars, as few as possible(?=\s*\d+\. OAS / [A-Z]{2}|\z)
- a positive lookahead: immediately to the right of the current location, there must
\s*\d+\. OAS / [A-Z]{2}
- 0+ whitespaces, 1+ digits, .
, space, /
, space, two uppercase letters|
- or\z
- end of string.Upvotes: 0
Reputation: 7312
We could do this with a positive lookahead, looking for the pattern of a number, followed by a peroid:
str_split(have, "(?=\\d+\\.)")
[1] "" "1. OAS / AC 12345/this is a test string to regex, "
[3] "2. OAS / AD 79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."
And we can further clean it up:
str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]
[1] "1. OAS / AC 12345/this is a test string to regex, " "2. OAS / AD 79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. " "4. OAS / AZ 78798456/this is one mode test string to regex."
Upvotes: 1