dafky2000
dafky2000

Reputation: 74

preg_replace multiline match but preserve new lines

I need a one liner that trims PHP from an HTML file. The trick is that I also need it to preserve the newlines previously taken up by the PHP lines.

php -r "echo preg_replace('/<\\\\?.*(\\\\?>|\$)/Us','', file_get_contents(\$argv[1]));" -- "./index.php"

This "works" but does not preserve the new lines, for example:

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Resolves to:

<html>
  <head>

  </head>
  <body>
  </body>
<html>

But I need it to resolve to:

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Maybe I am using a hammer to drive a screw but what I am trying to do is remove the PHP code, run the result through htmlhint and have the reported line numbers actually match the lines in the file.

If there is a better solution, I would love to hear it. The end goal is to lint files that have a mix of PHP, Javascript and HTML with their respective linters.

Upvotes: 0

Views: 297

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

Ok one line using the tokenizer (Ugly thing inside):

php -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat("\n",@count_chars($i.$i[1])[10]);});'

demo

Advantage of the tokenizer: even a string like "abc <?php echo '?>'; ?> def" is correctly parsed.

321 is the value of the constant T_INLINE_HTML (all that isn't between php tags).

10 is ASCII code for the newline character (LF). (by default, count_chars returns an associative array with the ASCII codes as keys and the number of occurrences as values).

The ugly thing is $i.$i[1] that concatenates an array with a string or a string with something not defined. @ avoids the warnings and notices. Whatever, this trick avoids a test and the number of newline characters is preserved. (see what returns token_get_all to understand the problem).


Or with DOMDocument:

php -r '$d=DOMDocument::loadHTMLFile($argv[1],8196);foreach((new DOMXPath($d))->query("//processing-instruction()")as$p)$p->parentNode->replaceChild($d->createTextNode(preg_replace("~\S+~","",$p->nodeValue)),$p);echo$d->saveHTML();'

Upvotes: 0

ctwheels
ctwheels

Reputation: 22817

Brief

Regex is definitely not the best answer for this problem, but since you're looking for an answer in regular expression form, here you have it!

Note: This will break if a comment or string contains <?.


Code

See this regex in use here

(?:\G(?!\A)|\h*(?=<\?))(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>)))

Results

Input

<html><?php test(); ?>
  <head>
    <?php test();

    ?>
  </head>
  <body>
  </body>
<html>

Output

<html>
  <head>



  </head>
  <body>
  </body>
<html>

Explanation

  • (?:\G(?!\A)|\h*(?=<\?)) Match either of the following options
    • \G(?!\A)
      • \G Assert position at the end of the previous match or the start of the string for the first match
      • (?!\A) Negative lookahead asserting what follows is not the start of the string (this basically makes \G only match the end of the previous match)
    • \h*(?=<\?) Match the following
      • \h* Match any number of horizontal spaces (used for cleanup of whitespaces before <?
      • (?=<\?) Positive lookahead ensuring the following matches
        • < Match the less than character < literally
        • \? Match the question mark character ?literally
  • (.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>))) Capture the following into capture group 1
    • .* Match any character (except for line terminators) any number of times
    • (?=(?:(?!<\?)[\s\S])*?(?<=\?>)) Positive lookahead ensuring what follows matches
      • (?:(?!<\?)[\s\S])*? Match the following any number of times, but as few as possible
        • (?!<\?) Negative lookahead ensuring what follows is not matched
          • < Match the less than character < literally
          • \? Match the question mark character ? literally
        • [\s\S] Match any character
      • (?<=\?>) Negative lookbehind ensuring what precedes matches the following
        • \? Match the question mark character ? literally
        • > Match the greater than character > literally

Upvotes: 2

Related Questions