Reputation: 675
I'm using JSoup to parse string which contains HTML tags to plain text. For example:
String newStr = Jsoup.parse(testStrHTML).text();
It is parsing it very well but problem is if my Java string contains a data between <
and >
for e.g. Hello <[email protected]>
so it is removing email address data. Output I'm getting is Hello, where I'm expecting Hello <[email protected]>
.
I have tried it with regular expression as well like
String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");
But still problem.
Is there anyway to parse HTML tags without custom data between <
and >
Upvotes: 1
Views: 613
Reputation: 3437
Your regexp
String newStr = testStrHTML.replaceAll("\\<.*?\\>", "");
Completly removes the tag. It matches the start of the < at the beginning of the tag, the label of the tags, any attributes of the tag and the final >. It then replaces this with an empty string.
String newStr = testStrHTML.replaceAll("\\<.([^>]*)\\>", "\\1");
Should replace all tags with the label and any attributes of the tag. This roughly matches the same as your regexp, but it replaces the match with the text within the brackets.
Note that this removes context so it might not be a good solution. It also doesn't produce easily readable output because valid html is partially retained.
It might be better to stay with Jsoup and navigate the DOM.
Upvotes: 2