mcNogard
mcNogard

Reputation: 49

Extract all visible text from html

I am trying to create a search function in google chrome. Given a string it will highlight all areas containing this string. I use java. I

To do this, first I need to extract all visible text. I have tried to analyze html pages in order to figure out how to extract only text.

For sections that looks like this, it seems

To do this, I planned on using jsoup. I am not sure how to extract text from sections that looks like this. (This is a youtube comment with a "read more" link and "show less" link).

From this section, I try to extract "Not gonna lie, dat dog is ADORABLE" and ("Les mer" or "Vis mindre" depending on which of them is visible).

<div class="comment-renderer-text" tabindex="0" role="article">
    <div class="comment-renderer-text-content">Not gonna lie, dat dog is ADORABLE</div>
        <div class="comment-text-toggle hid">
            <div class="comment-text-toggle-link read-more">
                <button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
                    <span class="yt-uix-button-content">Les mer
                    </span>
                </button>
            </div>
        <div class="comment-text-toggle-link show-less hid">
            <button class="yt-uix-button yt-uix-button-size-default yt-uix-button-link" type="button" onclick="return false;">
                <span class="yt-uix-button-content">Vis mindre
                </span>
            </button>
        </div>
    </div>
</div>

Upvotes: 1

Views: 1997

Answers (1)

Jop
Jop

Reputation: 90

I am going to assume that the html code given is already in a document named doc.

String text = doc.select("div.comment-renderer-text-content").first().text();

The doc.select command gets Elements that contain that specified HTML query. Then I get the first one and convert it to text.

More can be read here: Jsoup Selector

Edit:

You can use this code to get visible text rather than per class:

String text = doc.body().text();

Upvotes: 1

Related Questions