Multiple matches with regular expression not being returned

Question

I am using TinyMCE and it is converting all my attribute single quotes to double quotes on cleanup.

This is what I am putting into the editor.


Affiliate Accounts

and this is what the editor does after saving it


Accounts

There doesn't seem to be a way to override the setting in TinyMCE. So I am turning to RegEx with PHP when saving the data to the database. This is what I have so far, but doesn't seem to be capturing all the double quotes.

$content = preg_replace_callback('/<(.*)(\")(.*)(\")(.*)>/miU', function($matches) {
  return "<" . $matches[1] . "'" . html_entity_decode($matches[3]) . "'" . $matches[5] . ">";
}, $content);

It is replacing the json encoded string, but not the colspan="6"

Thanks in advance for the help.

Alex Sveshnikov · Accepted Answer

As I said in the comment, it's not very good to parse HTML with regex, better to use special libraries like PHP Simple HTML DOM Parser. However it's possible to construct a regex which will work on a correct HTML.

Our goal is to find all double-quoted strings inside a tag. First let's forget about requirement that the double-quoted string must be inside a tag. Then we can use this:

$content = preg_replace_callback('/"(.*?)"/', 
  function($matches) {
    return "'" . html_entity_decode($matches[1]) . "'" 
  }, 
  $content);

Now we need to add the check that the double-quoted string is inside a tag. To do this we construct a lookahead expression which checks the text between our double-quoted string and the end of the text:

there must be a tag-closing > there. It means that there must be some sequence of non-<, non-> characters followed by >. The corresponding regex is [^<>]*>
it must be followed by any number of complete tags < and >. The regex for a group of characters containing a single tag is [^<]*<[^>]*>. We need to repeat this group any number of times: (?:[^<]*<[^>]*>)*
there might be some non-<, non-> characters till the end of the text: [^<>]*$

The resulting lookahead expression looks a bit terrifying, but does the work: (?=[^<>]*>(?:[^<]*<[^>]*>)*[^<>]*$).

Finally, we incorporate this lookahead check into the original regex:

$content = preg_replace_callback('/"(?=[^<>]*>(?:[^<]*<[^>]*>)*[^<>]*$)(.*?)"/', 
  function($matches) {
    return "'" . html_entity_decode($matches[1]) . "'" 
  }, 
  $content);

You can check it here: Regex101 demo

Multiple matches with regular expression not being returned

Answers (1)

Related Questions