Vinod Kumar
Vinod Kumar

Reputation: 675

Using JSoup to remove only HTML tags and not data within '<' and '>' tags

I'm using JSoup to parse string which contains HTML tags to plain text. For example:

String newStr = Jsoup.parse(testStrHTML).text();

It is parsing it very well but problem is if my Java string contains a data between < and > for e.g. Hello <[email protected]> so it is removing email address data. Output I'm getting is Hello, where I'm expecting Hello <[email protected]>.

I have tried it with regular expression as well like

String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");

But still problem.

Is there anyway to parse HTML tags without custom data between < and >

Upvotes: 1

Views: 613

Answers (1)

Taemyr
Taemyr

Reputation: 3437

Your regexp

String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");

Completly removes the tag. It matches the start of the < at the beginning of the tag, the label of the tags, any attributes of the tag and the final >. It then replaces this with an empty string.

String newStr = testStrHTML.replaceAll("\\<.([^>]*)\\>", "\\1");

Should replace all tags with the label and any attributes of the tag. This roughly matches the same as your regexp, but it replaces the match with the text within the brackets.

Note that this removes context so it might not be a good solution. It also doesn't produce easily readable output because valid html is partially retained.

It might be better to stay with Jsoup and navigate the DOM.

Upvotes: 2

Related Questions