Sattyaki
Sattyaki

Reputation: 406

Extract text array from HTML in JavaScript

We are doing dynamic translation of HTML type documents using translator service API (e.g., Azure). For that we need to strip the Markup and extract only the text part, because the APIs have character limit and we don't want to send useless markup characters to the API.

So if there is a HTML like below:

<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back</div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>
We want the text values in an array, like:
["Hello", "There", "World", "We are back", "Members", "Name", "Age", "Satt", "10", "Matt", "20"]

What is the best approach to do this? Should I use Regular expressions to parse and extract the HTML or should I use some kind of recursive algorithm to get the texts.

Any help is appreciated, Thanks.

Upvotes: 0

Views: 330

Answers (2)

Jack Fleeting
Jack Fleeting

Reputation: 24940

A non-regex approach to the problem - using xpath:

result = document.evaluate("//div//text()", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
words = []
for(let i = 0; i < result.snapshotLength; i++) {
  let node = result.snapshotItem(i);       
  target = node.nodeValue.trim();       
  if (target.length>0) {
   words.push(target)}
}

console.log(words);

The output is your expected array.

Upvotes: 1

caramba
caramba

Reputation: 22490

Update: You can select all needed HTML and then use a regex.

var result = [];
const regex = />([a-zA-Z \d\!]+)</gm;
const str = document.querySelectorAll('body *:not(style,script)')[0].innerHTML;
let m;

while ((m = regex.exec(str)) !== null) {
  result.push(m[1]);
}

console.log(result);
<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

Follow this link for more information about the regex: https://regex101.com/r/NF7sXZ/1/


As pointed out by charlietfl in the comments the first answer, does not work with the following markup:

<div>We are back <span>Yeah!</span></div>

Because that markup was not part of the question this might still be a valid solution:

var result = [];
var items = document.querySelectorAll('body div, body p, body th, body td, body span')
// you could obviously also use the same selector as in the updated answer above

items.forEach(item => {
  if(1 === item.childNodes.length) { // check if there is no more childNodes, means there is only text inside this element
    result.push(item.innerText)
  }
})

console.log(result)
<div>
<div>
    <p>Hello</p>
    <div>
        <p>There</p>
    </div>
    <div>World</div>
</div>
<div>
    <div>We are back<span>Yeah!</span></div>
    <div>
        <p>Members</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
            </tr>
            <tr>
                <td>Satt</td>
                <td>10</td>
            </tr>
            <tr>
                <td>Matt</td>
                <td>20</td>
            </tr>
        </table>
    </div>
</div>

Upvotes: 3

Related Questions