Harshit Khetan
Harshit Khetan

Reputation: 141

Split markdown snippet when headers come

I have the following snippet of markdown:

# Glossary

This guide is aimed to familiarize the users with definitions to relevant DVC
concepts and terminologies which are frequently used.

## Workspace directory

Also abbreviated as workspace, it is the root directory of a project where DVC
is initialized by running `dvc init` command. Therefore, this directory will
contain a `.dvc` directory as well.

## Cache directory

DVC cache is a hidden storage which is found at `.dvc/cache`. This storage is
used to manage different versions of files which are under DVC control. For more
information on cache, please refer to the this
[guide](/doc/commands-reference/config#cache).

I want to split it such that there are there matches which should be:

# Glossary
...
## Workspace directory
...
## Cache directory
...

I tried to match them using regex /#{1,2}\s.+\n{2}[^(#{2}\s)]*/. My intention was to match the heading first with this part #{1,2}\s.+\n{2} and then terminate matching when ##\s is found. But I'm failing with the second part. Can anyone guide me?

Upvotes: 2

Views: 803

Answers (2)

farf
farf

Reputation: 11

I know this is an old post but the subject matter remains relevant and I hope someone with more regex knowledge than me will see this comment and provide an update.

I have been using Wiktor's match regex to find headings and the subsequent text before the next heading.

It works well unless there is a h1 (#) header anywhere in the body of the text. If present, it will be “gobbled up” and become part of the previous section since the regex effectively stops when it sees two or more # followed by a space, and "# " doesn't match that criterion.

This will fail:

## header 2
some text
# header 1
some more text
## header 2b

the first match will be:

## header 2
some text
# header 1
some more text

instead of:

## header 2
some text

The assumption seems to be that there is only one h1 (#) header and it is not preceded by any other headings, then I have found no issues.

To be honest this isn't a real issue in practice for me and I only discovered it when trying to understand the regex in regex101.com.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627327

Use split with /^(?=#+ )/m regex (demo) or match with match(/^#+ [^#]*(?:#(?!#)[^#]*)*/gm) (see another demo):

let contents = `# Glossary

This guide is aimed to familiarize the users with definitions to relevant DVC
concepts and terminologies which are frequently used.

## Workspace directory

Also abbreviated as workspace, it is the root directory of a project where DVC
is initialized by running \`dvc init\` command. Therefore, this directory will
contain a \`.dvc\` directory as well.

## Cache directory

DVC cache is a hidden storage which is found at \`.dvc/cache\`. This storage is
used to manage different versions of files which are under DVC control. For more
information on cache, please refer to the this
[guide](/doc/commands-reference/config#cache).`;

console.log(contents.split(/^(?=#+ )/m).filter(Boolean));
console.log(contents.match(/^#+ [^#]*(?:#(?!#)[^#]*)*/gm));

Output:

[
  "# Glossary\n\nThis guide is aimed to familiarize the users with definitions to relevant DVC\nconcepts and terminologies which are frequently used.\n\n",
  "## Workspace directory\n\nAlso abbreviated as workspace, it is the root directory of a project where DVC\nis initialized by running `dvc init` command. Therefore, this directory will\ncontain a `.dvc` directory as well.\n\n",
  "## Cache directory\n\nDVC cache is a hidden storage which is found at `.dvc/cache`. This storage is\nused to manage different versions of files which are under DVC control. For more\ninformation on cache, please refer to the this\n[guide](/doc/commands-reference/config#cache)."
]

Regex #1 (splitting) graph:

enter image description here

Regex #2 (matching) graph:

enter image description here

Upvotes: 2

Related Questions