Reputation: 49
I am trying to create a search function in google chrome. Given a string it will highlight all areas containing this string. I use java. I
To do this, first I need to extract all visible text. I have tried to analyze html pages in order to figure out how to extract only text.
For sections that looks like this, it seems
To do this, I planned on using jsoup. I am not sure how to extract text from sections that looks like this. (This is a youtube comment with a "read more" link and "show less" link).
From this section, I try to extract "Not gonna lie, dat dog is ADORABLE" and ("Les mer" or "Vis mindre" depending on which of them is visible).
<div class="comment-renderer-text" tabindex="0" role="article">
<div class="comment-renderer-text-content">Not gonna lie, dat dog is ADORABLE</div>
<div class="comment-text-toggle hid">
<div class="comment-text-toggle-link read-more">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Les mer
</span>
</button>
</div>
<div class="comment-text-toggle-link show-less hid">
<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
<span class="yt-uix-button-content">Vis mindre
</span>
</button>
</div>
</div>
</div>
Upvotes: 1
Views: 1997
Reputation: 90
I am going to assume that the html code given is already in a document named doc.
String text = doc.select("div.comment-renderer-text-content").first().text();
The doc.select command gets Elements that contain that specified HTML query. Then I get the first one and convert it to text.
More can be read here: Jsoup Selector
Edit:
You can use this code to get visible text rather than per class:
String text = doc.body().text();
Upvotes: 1