Ahmad
Ahmad

Reputation: 9668

JavaScript split markdown first into headings and then into sentences

I want to split a markdown text like the following first to parts with a heading and then to sentences.

# Heading
some text including multiple sentences...
## another heading
some text including multiple sentences....
## ...

Into :

# Heading
sent1 
-----
sent2
-----
....
----
## another heading
sent1
----
sent2
----
....
----
## ...

It's what I tried:

var HReg = new RegExp(/^(#{1,6}\s)(.*)/, 'gm');
var SentReg = new RegExp(/\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/, 'g');


var res1 = text.replace(HReg, function (m, g1, g2) {
    return g1 + g2 + "\r";
});

result = res1.replace(SentReg, function (m, g1, g2) {
    return g1 ? g1 : g2 + "\r"; // it's for ignoring abbreviations.
});

arr = result.split('\r');

But it separates some headings from their first sentence or include another heading to its previous sentence.

Upvotes: 0

Views: 1096

Answers (1)

James Wilkins
James Wilkins

Reputation: 7378

This is by no means the best option (a proper parser is recommended), but here is a Regex which will serve good enough as a POC:

var s = `# Heading
some text, including multiple sentences. some text including multiple sentences! some text including multiple sentences?
## another heading
some text including multiple sentences. some text including multiple sentences! some text including multiple sentences?
## ABC
some text including multiple sentences. some text including multiple sentences! some text including multiple sentences?
`;

var result = s.match(/(#+.*)|([^!?;.\n]+.)/g).map(v=>v.trim())

0: "# Heading"
1: "some text, including multiple sentences."
2: "some text including multiple sentences!"
3: "some text including multiple sentences?"
4: "## another heading"
5: "some text including multiple sentences."
6: "some text including multiple sentences!"
7: "some text including multiple sentences?"
8: "## ABC"
9: "some text including multiple sentences."
10: "some text including multiple sentences!"
11: "some text including multiple sentences?"

You can remove ; from between [ ] if you want to include that as part of a sentence block. This of course does not protect you from anyone who decides not to use punctuation. ;)

Upvotes: 1

Related Questions