user134611
user134611

Reputation: 776

Regex for matching text between two regex-patters

I am looking for a way to capture text and its paragraph title from a text document.

Text File:

paraTitle-1
--------
Lines and words
empty....
more lines



still part of paraTitle-1

paraTitle-2
--------
Lines and words
empty....
more lines



still part of paraTitle-2

I want to capture both the titles and the text below them.

 array = [paraTitle-1: <text...below paraTitle-11>,
          paraTitle-2: <text below paraTitle-2>]

I made a few attempts with pattern (?<=(.*))\n----*\n(?=(.*)) to no avail. Any guidance would be awesome.

Upvotes: 0

Views: 45

Answers (1)

Andreas
Andreas

Reputation: 159086

The following regex will do:

(?!--------\R)(.*)\R--------\R((?:\R?(?!.*\R--------\R).*)+)

See regex101.

The title separator line (--------) can also be specified as -{8}, which is easier to adjust to variable length if needed, e.g. instead of exactly 8 dashes, it could be 6 or more: -{6,}

Explanation:

  • Capture a line of text (paragraph title):

    (.*)\R
    
    • The . doesn't match line break characters
    • \R matches line breaks, including the Windows CRLF pair. If your regex engine doesn't support \R, use \r?\n as a simple alternative.
  • Make sure the captured text is not the title separator line:

    (?!--------\R)
    
  • Skip the mandatory title separator line:

    --------\R
    
  • Capture the paragraph text, as a repeating group of lines:

    ((?:xxx)+)
    
  • A line has an optional leading line break (first line doesn't have one):

    \R?.*
    
  • But make sure the line is not the title of the next paragraph, i.e. it's not a line followed by the title separator line.

    (?!.*\R--------\R)
    

Upvotes: 1

Related Questions