Kim Stacks
Kim Stacks

Reputation: 10812

Regex to extract between start and end strings and match the entire line containing the end string

Problem

I have a long unstructured text which I need to extract groups of text out.

I have an ideal start and end.

This is an example of the unstructured text truncated:

more useless gibberish at the begininng...
separated by new lines...
START                                              Fund Class                                            Fund Number                                   Fund Currency
XYZ                                      XYZ                                           XYZ                                          USD

                                                                                                                                                                bunch of text with lots of newlines in between...                                              Closing                              11.11                                                1,111.11   111,111.11

more useless gibberish between the groups...
separated by new lines...

START                                              Fund Class                                            Fund Number                                   Fund Currency
XYZ                                      XYZ                                           XYZ                                          USD

The word START appears in the middle sometimes multiple times, but it's fine                                                                                                                                                             bunch of text with lots of newlines in between...                                              Closing                              22.22                                                2,222.22   222,222.22

more useless gibberish at the end...
separated by new lines...

What I have tried

In the example above, I want to extract out 2 groups of text that lie between START and Closing

I have successfully done so using regex

/(?<=START)(?s)(.*?)(?=Closing)/g

This is the result https://regex101.com/r/vo7CLx/1/

What's wrong?

Unfortunately, I also need to extract the end of the line containing Closing string.

If you notice from the regex101 link, there's a Closing 11.11 1,111.11 111,111.11 in the first match. And a Closing 22.22 2,222.22 222,222.22 in the second match.

Which the regex does not match.

Is there a way to do this in a single regex? so that even the ending tag with the numbers are included?

Upvotes: 1

Views: 5093

Answers (3)

Usman
Usman

Reputation: 2029

You can try this regex,

START(.*)Closing(.*)(((.?\d{1,3})+.\d+)+.\d+.\d+.\d)\d

Output of the code

Upvotes: 0

Gurmanjot Singh
Gurmanjot Singh

Reputation: 10360

Try this Regex:

(?s)(?<=START)(.*?Closing(?:\s*[\d.,])+)

Click for Demo

Explanation:

  • (?s) - single line modifier which means a . in the regex will match a newline
  • (?<=START) - Positive lookbehind to find the position immediately preceded by a START
  • (.*?Closing(?:\s*[\d.,])+) - matches 0+ occurrences of any character lazily until the next occurrence of the word Closing which is followed by a sequence (?:\s*[\d.,])+
    • (?:\s*[\d.,])+ - matches 0+ occurrences of a whitespace followed by a digit or a . or a ,. The + at the end means we have to match this sub-pattern 1 or more times

Upvotes: 2

csabinho
csabinho

Reputation: 1609

(START)(?s)(.*?)(Closing)(\s+((,?\d{1,3})+.\d+))+ should match everything you want, see here!

Upvotes: 1

Related Questions