DrBloke
DrBloke

Reputation: 21

javascript regex paragraph not ending with full stop

I have a document containing lots of paragraphs. Some of these are subheadings, which are identifiable because they do not end with a full stop, like this:

<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

I want to make the titles go into an h3 tag but not the sentences. So I need to find and replace all paragraphs not ending in a full stop. I need to do this with javascript I have tried the following but each fails. In each case the text is first read into a variable called body.

body = body.replace(/<p>(.*?)(?!\.)<\/p>/gi, "<h3>$1</h3>");

That just makes everything bold

This would work, I think:

body = body.replace(/<p>(.*?)(?<!\.)<\/p>/gi, "<h3>$1</h3>");

but javascript does not recognise negative look behind.

Any ideas how I do this?

Upvotes: 1

Views: 1085

Answers (2)

Denys S&#233;guret
Denys S&#233;guret

Reputation: 382274

You could do the replacement paragraph per paragraph, which would be cleaner that doing a regex on the whole HTML:

[].forEach.call(document.getElementsByTagName('p'), function(p){
     if (!/[.?!]\s*$/.test(p.innerHTML)) p.outerHTML="<h3>"+p.innerHTML+"</h3>";
});
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>You want to handle questions, right?</p>
<p>I'm sure you do!</p>
<p>This is a title containing 1.2 million</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

This way there's no problem if your HTML evolves (will you really always have only P elements?).

Upvotes: 3

Sam
Sam

Reputation: 20486

You're over thinking it. Keep it simple!

body = body.replace(/<p>(.*?[^.])<\/p>/gi, "<h3>$1</h3>");
//                          ^^^^

No need for the lookarounds, just match a non-period character at the end of a 0+ dot-match-all.

Note: I would use Denys' solution (which I +1'd) since regex isn't a good idea for HTML.


Update:

Check out this expression:

<p>((?:.(?!\.))*?)<\/p>

This lazily loops through a non-capturing group containing a negative lookahead 0+ times. The only exception here is it doesn't check the first character for a period (since there is one initial dot-match-all), but this can be fixed with a lookahead at the beginning:

<p>((?=[^.])(?:.(?!\.))*?)<\/p>

Upvotes: 1

Related Questions