GhostOrder
GhostOrder

Reputation: 663

Regex - match a pattern only if it is after or before certain pattern

I have giant string (markdown) that contains something like this:

## Header 1

{~1.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~2.0} vitae congue erat accumsan nec. {~3.0}

{~4.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~5.0} vitae congue erat accumsan nec. {~6.0}

{~7.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~8.0} vitae congue erat accumsan nec. {~9.0}

## Header 2

{~10.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~11.0} vitae congue erat accumsan nec. {~12.0}

{~113.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~14.0} vitae congue erat accumsan nec. {~15.0}

{~16.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~17.0} vitae congue erat accumsan nec. {~18.0}

## Header 3

{~19.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~20.0} vitae congue erat accumsan nec. {~21.0}

{~22.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~23.0} vitae congue erat accumsan nec. {~24.0}

{~25.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~26.0} vitae congue erat accumsan nec. {~27.0}

This is a marker {~x.x}

And I will call "section" to the combination of a header and one more more paragraphs.

I need to match the first and the last marker of every section.

Currently I'm using this regex /\s?{([^}]*(~\d*(?:\.\d+)?)[^}]*)}\s?/g in javascript that I got from the selected answer of this question to capture all the markers, but now I need to modify it to capture only the first and the last ones from every 'section'.

The string comes from user input so I cannot know in advance how many paragraphs a 'section' will have neither the content of the headers, all that I know is that there will be at least one section (meaning one header followed by x amount of paragraphs).

Upvotes: 1

Views: 81

Answers (2)

InSync
InSync

Reputation: 10783

This is possible with lookarounds, which JS supports.

Since we're reusing the original pattern a lot, let's store it in a variable:

const pattern = String.raw`{([^}]*(?:~\d*(?:\.\d+)?)[^}]*)}`;

A string that doesn't contain the pattern above looks like this, where [^] denotes "all character", similar to a . with the s flag:

`(?:(?!${pattern})[^])*`

From that, we construct our lookahead and lookbehind:

// Pattern, anything that doesn't contain pattern, then header or end of string (not end of line).
const lookahead = `${pattern}(?=(?:(?!${pattern})[^])*(?:^##.+|(?![^])))`;

// Header, anything that doesn't contain pattern, then pattern itself.
const lookbehind = `(?<=^##.+$(?:(?!${pattern})[^])*)${pattern}`;

Here's how our final steps go:

const regex = new RegExp(`${lookbehind}|${lookahead}`, 'gm');

// Filter out unmatched groups.
[...text.matchAll(regex)].map(match => match.filter(Boolean));

Try it:

console.config({ maximize: true });

function match(string) {
  const pattern = String.raw`{([^}]*(?:~\d*(?:\.\d+)?)[^}]*)}`;
  const lookahead = `${pattern}(?=(?:(?!${pattern})[^])*(?:^##.+|(?![^])))`;
  const lookbehind = `(?<=^##.+$(?:(?!${pattern})[^])*)${pattern}`;
  const regex = new RegExp(`${lookbehind}|${lookahead}`, 'gm');
  
  console.log(regex); // Just to show you how monstrous it is.
  
  return string.matchAll(regex);
}

const text = `
## Header 1

{~1.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~2.0} vitae congue erat accumsan nec. {~3.0}

{~4.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~5.0} vitae congue erat accumsan nec. {~6.0}

{~7.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~8.0} vitae congue erat accumsan nec. {~9.0}

## Header 2

{~10.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~11.0} vitae congue erat accumsan nec. {~12.0}

{~113.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~14.0} vitae congue erat accumsan nec. {~15.0}

{~16.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~17.0} vitae congue erat accumsan nec. {~18.0}

## Header 3

{~19.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~20.0} vitae congue erat accumsan nec. {~21.0}

{~22.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~23.0} vitae congue erat accumsan nec. {~24.0}

{~25.0} Lorem ipsum dolor sit amet. Sed congue diam turpis, {~26.0} vitae congue erat accumsan nec. {~27.0}
`.trim();

console.log([...match(text)].map(match => match.filter(Boolean)));
<script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>

Upvotes: 0

Thomas Frank
Thomas Frank

Reputation: 1440

This is my variant, less regexp:y than most others perhaps, but it works:

function getNumbers(str) {
  return `\n${str}`.split('\n## ')
    .map(x => [...x.matchAll(/\{~(\d|\.)*\}/g)].map(x => x[0]))
    .map(x => [x[0], x.slice(-1)]).flat(2).filter(x => x)
    .map(x => +x.replace(/[\{\}~]/g, ''));
}

Upvotes: 1

Related Questions