Reputation: 11

How to extract and store in a string array the text between and that are inside tag having no html code(i.e &nbsp; etc) in it

Extract the text from strong and   tags in paragraph tag as a separate string. I have tried to split the text with   regex but the text contains HTML code like p, strong and nbsp.

Example code:

Document doc = Jsoup.parse(HTML);
Elements Paragraphs = doc.getElementsByTag("p");
String options = Paragraphs.first().html();
String[] singleOption = options.split("<br>");

I want to extract the text from strong and   tags and store each one in index of an array.

Upvotes: 1

Answers (1)

Samuel Philipp

Reputation: 11050

You can extend your split regex to  |</?strong> this splits a String at   and  tags. To remove other Tags you can use Jsoup.clean(string, Whitelist.none()). To unescape unicode characters use Parser.unescapeEntities(string, false).

Combining all that using Java Streams the solution would look like this:

Document doc = Jsoup.parse(html);
String[] parts = doc.select("p").stream()
        .flatMap(e -> Stream.of(e.html().split("<br>|</?strong>")))
        .map(s -> Jsoup.clean(s, Whitelist.none()))
        .map(s -> Parser.unescapeEntities(s, false))
        .map(String::trim)
        .filter(s -> !s.isEmpty())
        .toArray(String[]::new);

This searches for all paragraphs parses them.

For the example input:

<p>foo b<i>a</i>r <strong>test</strong><br>abc&nbsp;xyz</p>
<p>hi <strong>this&nbsp;is<br>a<br>test</strong></p>

The result will be:

[foo bar, test, abc xyz, hi, this is, a, test]

Upvotes: 0

How to extract and store in a string array the text between <strong> and <br> that are inside <p> tag having no html code(i.e &nbsp; etc) in it

Answers (1)

Related Questions

How to extract and store in a string array the text between &lt;strong&gt; and &lt;br&gt; that are inside &lt;p&gt; tag having no html code(i.e &amp;nbsp; etc) in it

Answers (1)

Related Questions

How to extract and store in a string array the text between <strong> and <br> that are inside <p> tag having no html code(i.e   etc) in it