Dennis
Dennis

Reputation: 206

Regex - Capture multiple multiline text blocks with only a starting pattern

I have a very large text file with several entries like this:

    -------------------------------------
    
       LOTS OF
        MULTILINE
       TEXT
    
    *************************************
              MORE
       MULTILINE
         TEXT
    
    *************************************
    
       EVEN-MORE-TEXT-SOMETIMES-WITH-DASHES
    
    *************************************

    -------------------------------------

       2ND LOT OF
        MULTILINE
       TEXT
    
    *************************************
      MORE
       MULTILINE
         TEXT FOR 2ND LOT
    
    *************************************
    
       EVEN-MORE-TEXT-FOR-2ND

    *************************************

Note that these are only two entries, I don't care about the asterisks, but the text that follows the dashed line.

I want to get a capture group with all the text in each entry so that I can analyze it later line by line.

I can capture the first entry with an expression like this:

/-{37}\s*([\s\S]+)-{37}/gm

But I'm having trouble running the capture group several times because I don't have a clear terminator for the groups (since the *{37} appears several times)

Here's a regex 101 example:

https://regex101.com/r/XZQ5h6/1

How can I capture the text after the dashed line but before the next dashed line or the end of the file?

Edit: So to make my question clearer, the capture group I would expect for the first entry would be.

   LOTS OF
    MULTILINE
   TEXT

*************************************
          MORE
   MULTILINE
     TEXT

*************************************

   EVEN-MORE-TEXT-SOMETIMES-WITH-DASHES

*************************************

I also happen to have some dashes in the text, so I have edited the example. Ideally, I want an array of capture groups with just the content in the entries.

Upvotes: 2

Views: 159

Answers (3)

The fourth bird
The fourth bird

Reputation: 163362

If you want to capture the 2 parts in the example data in a capture group:

^-{37}\s*^((?:(?!--).*(?:\r?\n|$))+)

In parts, the pattern matches:

  • ^ Start of string
  • -{37} Match 37 times a - char
  • \s*^ Match optional whitespace chars, and assert start of string (note to use the multiline flag due to the anchors)
  • ( Capture group 1
    • (?: Non capture group
      • (?!--) Negative lookahead, assert not -- at the start of the string (or make it more specific like (?!-{37}\r?\n))
      • .*(?:\r?\n|$) Match the whole line followed by a newline or assert the end of the string
    • )+ Close non capture group 1+ times
  • ) Close group 1

Regex demo

Or a bit shorter, but then it would include the leading whitespace chars in the match:

^-{37}((?:\r?\n(?!--).*)*)

Regex demo

Upvotes: 0

DigitShifter
DigitShifter

Reputation: 854

This regex will match both entries:

/-{37}[^-]+/gm

Try it out in regex101.

Upvotes: 0

anubhava
anubhava

Reputation: 785196

You can use this regex:

-{37}\R+((?:.+\R)+)

RegEx Demo

RegEx Detail;

  • -{37}: Match hyphen of 37 in length
  • \R+: Match 1+ of line breaks
  • (: Start capture group
    • (?:.+\R)+: Match a line of 1+ character followed by a line break. Repeat this group 1+ times to match multiple of these lines
  • ): End capture group

Upvotes: 1

Related Questions