Reputation: 9299

Multiline RegEx to match YAML Frontmatter, only the first match, only when preceded by nothing other than space

Problem

The problem description is simple; I have a pile of text files, from which I wish to extract the frontmatter (described anon) alone, if it's there are all, and then stop processing the file any further.

Here's a sample valid example of a file with frontmatter; my comments (assume invisible from the file) will be in c-style comments:


      /*spaces & newlines are fine*/

---     /* i.e., /^---\s*$/ */
key: value
foo: bar, zip, grump
/*
Anything can go in here, once I have this section pulled out, the yaml schema
can do the reset.  All that's important to note is that this section must be
terminated explicitly with a subsequent /^---\s*$/ in order to be deemed valid.
---

Anything else can follow here, more accidental frontmatter blobs can exist,
but it should not matter since the other requirement is that the regex engine
will cease processing beyond the termination of the first match.

What I have so far, which doesn't address certain edge-cases is, using ripgrep/rg:

rg -g '!**/{node_modules,.*}/*' -g '*.md' -U '(?s)\s*^---$((?!---).*)^---$' -r '$1'

Problem with above right now is that it matches far past the first terminating --- in certain cases, for example where you have two frontmatter blobs, one after another.

Bonus Problem

I want to know how I can do this with the standard regex engine that rg defaults to, but also how to do this with PCRE2 (-P)
I want to know how I can have all flags embedded in the regex itself, rather than have -U for multiline, using (?m) for example

Upvotes: 1

Answers (2)

The fourth bird

Reputation: 163342

Your pattern (?s)\s*^---$((?!---).*)^---$ matches too much because you use (?s) to have to dot match a newline, and you use .* that will first match all the way to the end and then backtracks to fit in the ^---$ part

You could write the pattern using a tempered greedy token, repeating a non capture group inside a capture group in this case, but note that this part (?!---) would not allow any 3 consecutive hyphens in between. As the leading whitespace chars are optional, you can omit them.

(?s)^---$((?:(?!---).)*)^---$

Regex demo

You could write the pattern without (?s) making use of a negative lookahead (maybe you have to use (?m) for multiline but I am not sure about that with ripgrep)

Using pcre and \R to match newlines:

^---((?:\R(?!---$).*)*)\R---$

Explanation

^ Start of string
--- Match literally
( Capture group 1
- (?:\R(?!---$).*)* Match the whole line if it is not ---
) Close the capture group
\R---$ Match a unicode newline sequence, --- and assert the end of the string

Regex demo

If you want a match only using pcre, you could also opt for \K to forget what is matched so far, and a possessive quantifier.

Using the lookahead at the end means that it will assert (not match) the trailing ---

^---\K(?:\R(?!---$).*)*+(?=\R---$)

Regex demo

Upvotes: 0

markalex

Reputation: 13351

Solve your main problem I believe it it is enough to make your matcher lazy.

Also, negative lookahead is redundant here (and was used a little wrong, more on this at the end).

(?s)\s*^---$(.*?)^---$

I believe this regex should work for both pcre2 and default, since it doesn't use lookarounds. But I'm not entirely sure on default engine and (?s).

As for -U, I believe it changes behavior of app regarding reading of the file, so it's quite unlikely that you could abandon it.

Negative lookahead

It looks like you've tried to disallow any appearance of --- in matched block. If this is the case, it should be done with construction like: ((?!---).)*

Upvotes: 0

Multiline RegEx to match YAML Frontmatter, only the first match, only when preceded by nothing other than space

Problem

Bonus Problem

Answers (2)

Negative lookahead

Related Questions