Reputation: 15723

How to keep the HTML tags specified

I am using this pattern to remove all HTML tags (Java code):

String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");

System.out.println(html);

Now, I want to keep tag <a ...> (with </a>) and tag <img ...>

I want the result to be:

text <a href=#>link</a> b pic<img src=#>

How to do this?

I don't need HTML parser to do this,

because I need this regex pattern to filter a lot of html fragment,

so,I want the solution with regex

Upvotes: 1

Answers (5)

Huy - Logarit

Reputation: 686

I recommend you use strip_tags (a PHP function)

string strip_tags ( string $str [, string $allowable_tags ] )

    <?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

OUTPUT

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>

Upvotes: -1

Mark Byers

Reputation: 838096

You could do this using a negative lookahead:

"<(?!(?:a|/a|img)\\b).*?>"

Rubular

However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.

For more information see this question:

What HTML parsing libraries do you recommend in Java

Upvotes: 3

Daniel Cassidy

Reputation: 25607

Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.

If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.

How to keep the HTML tags specified

Answers (5)

Related Questions