Reputation: 10812
I have a long unstructured text which I need to extract groups of text out.
I have an ideal start and end.
This is an example of the unstructured text truncated:
more useless gibberish at the begininng...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
bunch of text with lots of newlines in between... Closing 11.11 1,111.11 111,111.11
more useless gibberish between the groups...
separated by new lines...
START Fund Class Fund Number Fund Currency
XYZ XYZ XYZ USD
The word START appears in the middle sometimes multiple times, but it's fine bunch of text with lots of newlines in between... Closing 22.22 2,222.22 222,222.22
more useless gibberish at the end...
separated by new lines...
In the example above, I want to extract out 2 groups of text that lie between START
and Closing
I have successfully done so using regex
/(?<=START)(?s)(.*?)(?=Closing)/g
This is the result https://regex101.com/r/vo7CLx/1/
Unfortunately, I also need to extract the end of the line containing Closing
string.
If you notice from the regex101
link, there's a Closing 11.11 1,111.11 111,111.11
in the first match. And a Closing 22.22 2,222.22 222,222.22
in the second match.
Which the regex does not match.
Is there a way to do this in a single regex? so that even the ending tag with the numbers are included?
Upvotes: 1
Views: 5093
Reputation: 2029
You can try this regex,
START(.*)Closing(.*)(((.?\d{1,3})+.\d+)+.\d+.\d+.\d)\d
Upvotes: 0
Reputation: 10360
Try this Regex:
(?s)(?<=START)(.*?Closing(?:\s*[\d.,])+)
Explanation:
(?s)
- single line modifier which means a .
in the regex will match a newline(?<=START)
- Positive lookbehind to find the position immediately preceded by a START
(.*?Closing(?:\s*[\d.,])+)
- matches 0+ occurrences of any character lazily until the next occurrence of the word Closing
which is followed by a sequence (?:\s*[\d.,])+
(?:\s*[\d.,])+
- matches 0+ occurrences of a whitespace followed by a digit or a .
or a ,
. The +
at the end means we have to match this sub-pattern 1 or more timesUpvotes: 2