Reputation: 11
Extract the text from strong and <br>
tags in paragraph tag as a separate string.
I have tried to split the text with <br>
regex but the text contains HTML code like p
, strong
and nbsp
.
Example code:
Document doc = Jsoup.parse(HTML);
Elements Paragraphs = doc.getElementsByTag("p");
String options = Paragraphs.first().html();
String[] singleOption = options.split("<br>");
I want to extract the text from strong and <br>
tags and store each one in index of an array.
Upvotes: 1
Views: 215
Reputation: 11050
You can extend your split regex to <br>|</?strong>
this splits a String at <br>
and <strong>
tags. To remove other Tags you can use Jsoup.clean(string, Whitelist.none())
. To unescape unicode characters use Parser.unescapeEntities(string, false)
.
Combining all that using Java Streams the solution would look like this:
Document doc = Jsoup.parse(html);
String[] parts = doc.select("p").stream()
.flatMap(e -> Stream.of(e.html().split("<br>|</?strong>")))
.map(s -> Jsoup.clean(s, Whitelist.none()))
.map(s -> Parser.unescapeEntities(s, false))
.map(String::trim)
.filter(s -> !s.isEmpty())
.toArray(String[]::new);
This searches for all paragraphs parses them.
For the example input:
<p>foo b<i>a</i>r <strong>test</strong><br>abc xyz</p>
<p>hi <strong>this is<br>a<br>test</strong></p>
The result will be:
[foo bar, test, abc xyz, hi, this is, a, test]
Upvotes: 0