Martin
Martin

Reputation: 5297

regex: put text outside <p> inside <p>

I have some broken html-code that i would like to fix with regex.

The html might be something like this:

<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>

But there can be much more paragraphs and other html-elements too.

I want to turn in into:

<p>text1</p>
<p>text2</p>
<p>text3</p>
<p>text4</p>
<p>text5</p>

Is this possible with a regex? I'm using php if that matters.

Upvotes: 0

Views: 359

Answers (3)

Christophe
Christophe

Reputation: 348

While regexes are not the best solution for this kind of job, this code works for the example you gave (it might not be optimal!)

<php>

$text = '<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>';

$regex = '|(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)|i';
$replacement = '${1}<p>${3}</p>${4}';
$replacedText =  preg_replace($regex, $replacement, $text);

echo $replacedText;
</php>

in the replacement string, see that you use match 1, 3 and 4 to get the correct sub-matches! If you want to be able to capture other HTML tags then

, you can use this regex:

$regex = '|(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)|i';

but be aware that it can mess stuff up, because the closing tag can match to something different.

Upvotes: 1

szbalint
szbalint

Reputation: 1633

No, this is generally a bad idea with regexes. Regexes don't do stateful parsing. HTML has implicit tags and requires state to be kept to parse.

HTML generally has lots of quirks. It is hard to write an HTML parser as not only you have to keep track of how things should be, but also account for broken behaviour seen in the wild.

Regexes are the wrong tool for this job.

Upvotes: 3

Knarf
Knarf

Reputation: 1273

Could http://htmlpurifier.org/ help you?

Upvotes: 1

Related Questions