Reputation: 74
I need a one liner that trims PHP from an HTML file. The trick is that I also need it to preserve the newlines previously taken up by the PHP lines.
php -r "echo preg_replace('/<\\\\?.*(\\\\?>|\$)/Us','', file_get_contents(\$argv[1]));" -- "./index.php"
This "works" but does not preserve the new lines, for example:
<html><?php test(); ?>
<head>
<?php test();
?>
</head>
<body>
</body>
<html>
Resolves to:
<html>
<head>
</head>
<body>
</body>
<html>
But I need it to resolve to:
<html>
<head>
</head>
<body>
</body>
<html>
Maybe I am using a hammer to drive a screw but what I am trying to do is remove the PHP code, run the result through htmlhint and have the reported line numbers actually match the lines in the file.
If there is a better solution, I would love to hear it. The end goal is to lint files that have a mix of PHP, Javascript and HTML with their respective linters.
Upvotes: 0
Views: 297
Reputation: 89557
Ok one line using the tokenizer (Ugly thing inside):
php -r 'echo array_reduce(token_get_all(file_get_contents($argv[1])),function($c,$i){return $i[0]==321?$c.$i[1]:$c.str_repeat("\n",@count_chars($i.$i[1])[10]);});'
Advantage of the tokenizer: even a string like "abc <?php echo '?>'; ?> def"
is correctly parsed.
321 is the value of the constant T_INLINE_HTML
(all that isn't between php tags).
10 is ASCII code for the newline character (LF). (by default, count_chars
returns an associative array with the ASCII codes as keys and the number of occurrences as values).
The ugly thing is $i.$i[1]
that concatenates an array with a string or a string with something not defined. @
avoids the warnings and notices. Whatever, this trick avoids a test and the number of newline characters is preserved. (see what returns token_get_all
to understand the problem).
Or with DOMDocument
:
php -r '$d=DOMDocument::loadHTMLFile($argv[1],8196);foreach((new DOMXPath($d))->query("//processing-instruction()")as$p)$p->parentNode->replaceChild($d->createTextNode(preg_replace("~\S+~","",$p->nodeValue)),$p);echo$d->saveHTML();'
Upvotes: 0
Reputation: 22817
Regex is definitely not the best answer for this problem, but since you're looking for an answer in regular expression form, here you have it!
Note: This will break if a comment or string contains <?
.
(?:\G(?!\A)|\h*(?=<\?))(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>)))
<html><?php test(); ?>
<head>
<?php test();
?>
</head>
<body>
</body>
<html>
<html>
<head>
</head>
<body>
</body>
<html>
(?:\G(?!\A)|\h*(?=<\?))
Match either of the following options
\G(?!\A)
\G
Assert position at the end of the previous match or the start of the string for the first match(?!\A)
Negative lookahead asserting what follows is not the start of the string (this basically makes \G
only match the end of the previous match)\h*(?=<\?)
Match the following
\h*
Match any number of horizontal spaces (used for cleanup of whitespaces before <?
(?=<\?)
Positive lookahead ensuring the following matches
<
Match the less than character <
literally\?
Match the question mark character ?
literally(.*(?=(?:(?!<\?)[\s\S])*?(?<=\?>)))
Capture the following into capture group 1
.*
Match any character (except for line terminators) any number of times(?=(?:(?!<\?)[\s\S])*?(?<=\?>))
Positive lookahead ensuring what follows matches
(?:(?!<\?)[\s\S])*?
Match the following any number of times, but as few as possible
(?!<\?)
Negative lookahead ensuring what follows is not matched
<
Match the less than character <
literally\?
Match the question mark character ?
literally[\s\S]
Match any character(?<=\?>)
Negative lookbehind ensuring what precedes matches the following
\?
Match the question mark character ?
literally>
Match the greater than character >
literallyUpvotes: 2