dmttk
dmttk

Reputation: 81

Regular expression to remove HTML tags without <br/> tab from a string

I am currently stuck at generating a regex for the following requirement of the strings <b>abc<br/></b> or xy<i>abcd<br/></i> or <th>ab<br/></th>wvx or etc.

My requirement is to remove < and > characters of <b> or </b> or <i> or </i> or <th> or </th> etc using java replaceAll(<regex>,""); method without replacing the < and > characters of <br/> tag.

Examples:

Input: <b>abc<br/></b> Output should be: babc<br/>/b

Input: xy<i>abcd<br/></i> Output should be: xyiabcd<br/>/i

Input: <th>ab<br/></th>wvx Output should be: thab<br/>/thwvx

....... etc.

Please help me to resolve this.

Upvotes: 0

Views: 1558

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521083

You may try using String#replaceAll:

String input = "<b>abc<br/></b>";
input = input.replaceAll("</?(?!br)([^>]+)>", "$1");
System.out.println(input);

babc<br/>b

The pattern </?(?!br)[^>]+)> will match any opening or closing HTML tag other than br. It will replace that tag with just the text name of the tag.

Note that parsing HTML with regex generally is not a good idea. This may work in your case if you only have single level HTML as in your example strings.

Demo

Upvotes: 1

AlexZam
AlexZam

Reputation: 1196

</?([a-z]+)> should do. If slash is after letters it will not match.

Upvotes: 1

Related Questions