John
John

Reputation: 11

How to extract and store in a string array the text between <strong> and <br> that are inside <p> tag having no html code(i.e   etc) in it

Extract the text from strong and <br> tags in paragraph tag as a separate string. I have tried to split the text with <br> regex but the text contains HTML code like p, strong and nbsp.

Example code:

Document doc = Jsoup.parse(HTML);
Elements Paragraphs = doc.getElementsByTag("p");
String options = Paragraphs.first().html();
String[] singleOption = options.split("<br>");

I want to extract the text from strong and <br> tags and store each one in index of an array.

Upvotes: 1

Views: 215

Answers (1)

Samuel Philipp
Samuel Philipp

Reputation: 11050

You can extend your split regex to <br>|</?strong> this splits a String at <br> and <strong> tags. To remove other Tags you can use Jsoup.clean(string, Whitelist.none()). To unescape unicode characters use Parser.unescapeEntities(string, false).

Combining all that using Java Streams the solution would look like this:

Document doc = Jsoup.parse(html);
String[] parts = doc.select("p").stream()
        .flatMap(e -> Stream.of(e.html().split("<br>|</?strong>")))
        .map(s -> Jsoup.clean(s, Whitelist.none()))
        .map(s -> Parser.unescapeEntities(s, false))
        .map(String::trim)
        .filter(s -> !s.isEmpty())
        .toArray(String[]::new);

This searches for all paragraphs parses them.

For the example input:

<p>foo b<i>a</i>r <strong>test</strong><br>abc&nbsp;xyz</p>
<p>hi <strong>this&nbsp;is<br>a<br>test</strong></p>

The result will be:

[foo bar, test, abc xyz, hi, this is, a, test]

Upvotes: 0

Related Questions