Reputation: 1959
I want to remove php tags from a string
content = re.sub('<\?php(.*)\?>', '', content)
seems to work OK on single line php tags but when a php tag closes some lines after , it cannot catch it. can anybody help?
Upvotes: 1
Views: 628
Reputation: 213847
If you just want to handle the simple cases, a simple regular expression will work fine. The *?
operator in Python regular expressions gives a minimal match.
import re
_PHP_TAG = re.compile(r'<\?php.*?\?>', re.DOTALL)
def strip_php(content):
return _PHP_TAG.sub('', content)
INPUT = """
Simple: <?php echo $a ?>.
Two on one line: <?php echo $a ?>, <?php echo $b ?>.
Multiline: <?php
if ($a) {
echo $b;
}
?>.
"""
print strip_php(INPUT)
Output:
Simple: . Two on one line: (keep this) . Multiline: .
I hope you're not using this to sanitize input, since this is not good enough for that purpose. (It's a blacklist, not a whitelist, and blacklists are never enough.)
If you want to handle the complicated cases, such as:
<?php echo '?>' ?>
You can still do it with regular expressions, but you may wish to reconsider what tools you are using, since the regular expressions may get too complicated to read. The following regular expression will handle all of Francis Avila's test cases:
dstr = r'"(?:[^"\\]|\\.)*"'
sstr = r"'(?:[^'\\]|\\.)*'"
_PHP_TAG = re.compile(
r'''<\?[^"']*?(?:(?:%s|%s)[^"']*?)*(?:\?>|$)''' % (dstr, sstr)
)
def strip_php(content):
return _PHP_TAG.sub('', content)
Regular expressions are almost powerful enough to solve this problem. The reason we know this is because PHP uses regular expressions to tokenize PHP source code. You can read the regular expressions PHP uses in Zend/zend_language_scanner.l
. It's written for Lex, which is a common tool that creates tokenizers from regular expressions.
The reason I say "almost" is because we are actually using extended regular expressions.
Upvotes: 2
Reputation: 31651
You cannot solve this problem with regular expressions. Parsing the PHP out of a string requires a real parser that understands at least a little PHP.
However, you can solve this problem pretty easily if you have PHP available. PHP solution at the end.
Here is a demonstration of how many ways you can go wrong with your regular-expression approach:
import re
testcases = {
'easy':("""show this<?php echo 'NOT THIS'?>""",'show this'),
'multiple tags':("""<?php echo 'NOT THIS';?>show this, even though it's conditional<?php echo 'NOT THIS'?>""","show this, even though it's conditional"),
'omitted ?>':("""show this <?php echo 'NOT THIS';""", 'show this '),
'nested string':("""show this <?php echo '<?php echo "NOT THIS" ?>'?> show this""",'show this show this'),
'shorttags':("""show this <? echo 'NOT THIS SHORTTAG!'?> show this""",'show this show this'),
'echotags':("""<?php $TEST = "NOT THIS"?>show this <?=$TEST?> show this""",'show this show this'),
}
testfailstr = """
FAILED: %s
IN: %s
EXPECT: %s
GOT: %s
"""
removephp = re.compile(r'(?s)<\?php.*\?>')
for testname, (in_, expect) in testcases.items():
got = removephp.sub('',in_)
if expect!=got:
print testfailstr % tuple(map(repr, (testname, in_, expect, got)))
Notice that it's extremely difficult, if not impossible to get a regular expression to pass all test cases.
If you have PHP available you can use PHP's tokenizer to strip out the PHP. The following code should strip all PHP code out of a string without fail, and should cover all strange corner cases as well.
// one-character token, always code
define('T_ONECHAR_TOKEN', 'T_ONECHAR_TOKEN');
function strip_php($input) {
$tokens = token_get_all($input);
$output = '';
$inphp = False;
foreach ($tokens as $token) {
if (is_string($token)) {
$token = array(T_ONECHAR_TOKEN, $token);
}
list($id, $str) = $token;
if (!$inphp) {
if ($id===T_OPEN_TAG or $id==T_OPEN_TAG_WITH_ECHO) {
$inphp = True;
} else {
$output .= $str;
}
} else {
if ($id===T_CLOSE_TAG) {
$inphp = False;
}
}
}
return $output;
}
$test = 'a <?php //NOT THIS?>show this<?php //NOT THIS';
echo strip_php($test);
Upvotes: 2
Reputation: 4448
You can do it through this:
content = re.sub('\n','', content)
content = re.sub('<\?php(.*)\?>', '', content)
updated answer after OP's comments:
content = re.sub('\n',' {NEWLINE} ', content)
content = re.sub('<\?php(.*)\?>', '', content)
content = re.sub(' {NEWLINE} ','\n', content)
example in ipython
:
In [81]: content
Out[81]: ' 11111 <?php 222\n\n?> \n22222\nasd <?php asd\nasdasd\n?>\n3333\n'
In [82]: content = re.sub('\n',' {NEWLINE} ', content)
In [83]: content
Out[83]: ' 11111 <?php 222 {NEWLINE} {NEWLINE} ?> {NEWLINE} 22222 {NEWLINE} asd <?php asd {NEWLINE} asdasd {NEWLINE} ?> {NEWLINE} 3333 {NEWLINE} '
In [84]: content = re.sub('<\?php(.*)\?>', '', content)
In [85]: content
Out[85]: ' 11111 {NEWLINE} 3333 {NEWLINE} '
In [88]: content = re.sub(' {NEWLINE} ','\n', content)
In [89]: content
Out[89]: ' 11111 \n3333\n'
Upvotes: -1