Yuriy Petrovskiy
Yuriy Petrovskiy

Reputation: 8178

How to remove tags from attribute of HTML tag in php?

I have a large amount of post generated with old CMS. It is in HTML markup...almost...the worse of I ever seen before. It contains such constructs:

....<IMG alt="Хит сезона - <b>Лучшие фразы...</b>" src="http://www.example.com/articles/pic.jpg" align=left>...

As you can see strictly it is not a HTML, becouse it contains tegs inside tag attributes.

I need to remove any tags from HTML attributes.

I had tried to use parsing through DOMDocument, but it cannot output cyrilic symbols correctly if headers body and html are not in parsed string. And even if it does I have to remove them from output.

The question is how to remove tags from attribute of HTML tag in PHP?

Is preg_replace is suitable for this?

Upvotes: 4

Views: 522

Answers (1)

entropid
entropid

Reputation: 6239

You could try this:

preg_replace('#<([^ ]+)((\s+[\w]+=((["\'])[^\5]+\5|[^ ]+))+)>#e', '"<\\1" . str_replace("\\\'", "\'", strip_tags("\\2")) . ">"', $code);

It takes every html opening tag (<something>), matches all the attributes name="value" name='value' name=value then it tag-strips them. The str_replace is necessary because when the e modifier is added, PHP use addslashes to every match before evaluating it.

I tested it and it seems to work fine. :)

Upvotes: 1

Related Questions