Reputation: 21
I have a document containing lots of paragraphs. Some of these are subheadings, which are identifiable because they do not end with a full stop, like this:
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
I want to make the titles go into an h3 tag but not the sentences. So I need to find and replace all paragraphs not ending in a full stop. I need to do this with javascript I have tried the following but each fails. In each case the text is first read into a variable called body.
body = body.replace(/<p>(.*?)(?!\.)<\/p>/gi, "<h3>$1</h3>");
That just makes everything bold
This would work, I think:
body = body.replace(/<p>(.*?)(?<!\.)<\/p>/gi, "<h3>$1</h3>");
but javascript does not recognise negative look behind.
Any ideas how I do this?
Upvotes: 1
Views: 1085
Reputation: 382274
You could do the replacement paragraph per paragraph, which would be cleaner that doing a regex on the whole HTML:
[].forEach.call(document.getElementsByTagName('p'), function(p){
if (!/[.?!]\s*$/.test(p.innerHTML)) p.outerHTML="<h3>"+p.innerHTML+"</h3>";
});
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>You want to handle questions, right?</p>
<p>I'm sure you do!</p>
<p>This is a title containing 1.2 million</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
This way there's no problem if your HTML evolves (will you really always have only P elements?).
Upvotes: 3
Reputation: 20486
You're over thinking it. Keep it simple!
body = body.replace(/<p>(.*?[^.])<\/p>/gi, "<h3>$1</h3>");
// ^^^^
No need for the lookarounds, just match a non-period character at the end of a 0+ dot-match-all.
Note: I would use Denys' solution (which I +1'd) since regex isn't a good idea for HTML.
Update:
Check out this expression:
<p>((?:.(?!\.))*?)<\/p>
This lazily loops through a non-capturing group containing a negative lookahead 0+ times. The only exception here is it doesn't check the first character for a period (since there is one initial dot-match-all), but this can be fixed with a lookahead at the beginning:
<p>((?=[^.])(?:.(?!\.))*?)<\/p>
Upvotes: 1