Reputation: 9299
The problem description is simple; I have a pile of text files, from which I wish to extract the frontmatter (described anon) alone, if it's there are all, and then stop processing the file any further.
Here's a sample valid example of a file with frontmatter; my comments (assume invisible from the file) will be in c-style comments:
/*spaces & newlines are fine*/
--- /* i.e., /^---\s*$/ */
key: value
foo: bar, zip, grump
/*
Anything can go in here, once I have this section pulled out, the yaml schema
can do the reset. All that's important to note is that this section must be
terminated explicitly with a subsequent /^---\s*$/ in order to be deemed valid.
---
Anything else can follow here, more accidental frontmatter blobs can exist,
but it should not matter since the other requirement is that the regex engine
will cease processing beyond the termination of the first match.
What I have so far, which doesn't address certain edge-cases is, using ripgrep
/rg
:
rg -g '!**/{node_modules,.*}/*' -g '*.md' -U '(?s)\s*^---$((?!---).*)^---$' -r '$1'
Problem with above right now is that it matches far past the first terminating ---
in certain cases, for example where you have two frontmatter blobs, one after another.
rg
defaults to, but also how to do this with PCRE2
(-P
)-U
for multiline, using (?m)
for exampleUpvotes: 1
Views: 211
Reputation: 163342
Your pattern (?s)\s*^---$((?!---).*)^---$
matches too much because you use (?s)
to have to dot match a newline, and you use .*
that will first match all the way to the end and then backtracks to fit in the ^---$
part
You could write the pattern using a tempered greedy token, repeating a non capture group inside a capture group in this case, but note that this part (?!---)
would not allow any 3 consecutive hyphens in between. As the leading whitespace chars are optional, you can omit them.
(?s)^---$((?:(?!---).)*)^---$
You could write the pattern without (?s)
making use of a negative lookahead (maybe you have to use (?m)
for multiline but I am not sure about that with ripgrep)
Using pcre and \R
to match newlines:
^---((?:\R(?!---$).*)*)\R---$
Explanation
^
Start of string---
Match literally(
Capture group 1
(?:\R(?!---$).*)*
Match the whole line if it is not ---
)
Close the capture group\R---$
Match a unicode newline sequence, ---
and assert the end of the stringIf you want a match only using pcre, you could also opt for \K
to forget what is matched so far, and a possessive quantifier.
Using the lookahead at the end means that it will assert (not match) the trailing ---
^---\K(?:\R(?!---$).*)*+(?=\R---$)
Upvotes: 0
Reputation: 13351
Solve your main problem I believe it it is enough to make your matcher lazy.
Also, negative lookahead is redundant here (and was used a little wrong, more on this at the end).
(?s)\s*^---$(.*?)^---$
I believe this regex should work for both pcre2 and default, since it doesn't use lookarounds. But I'm not entirely sure on default engine and (?s)
.
As for -U
, I believe it changes behavior of app regarding reading of the file, so it's quite unlikely that you could abandon it.
It looks like you've tried to disallow any appearance of ---
in matched block. If this is the case, it should be done with construction like: ((?!---).)*
Upvotes: 0