Reputation: 15723
I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...>
(with </a>
) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex
Upvotes: 1
Views: 875
Reputation: 686
I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>
Upvotes: -1
Reputation: 838096
You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
Upvotes: 3
Reputation: 25607
Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.
Upvotes: 0
Reputation: 10722
Hey! Here is your answer:
You can’t parse [X]HTML with regex.
Upvotes: 0
Reputation: 2686
Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.
Upvotes: 0