Reputation: 406
We are doing dynamic translation of HTML type documents using translator service API (e.g., Azure). For that we need to strip the Markup and extract only the text part, because the APIs have character limit and we don't want to send useless markup characters to the API.
So if there is a HTML like below:
<div>
<div>
<p>Hello</p>
<div>
<p>There</p>
</div>
<div>World</div>
</div>
<div>
<div>We are back</div>
<div>
<p>Members</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>Satt</td>
<td>10</td>
</tr>
<tr>
<td>Matt</td>
<td>20</td>
</tr>
</table>
</div>
</div>
We want the text values in an array, like:
["Hello", "There", "World", "We are back", "Members", "Name", "Age", "Satt", "10", "Matt", "20"]
What is the best approach to do this? Should I use Regular expressions to parse and extract the HTML or should I use some kind of recursive algorithm to get the texts.
Any help is appreciated, Thanks.
Upvotes: 0
Views: 330
Reputation: 24940
A non-regex approach to the problem - using xpath:
result = document.evaluate("//div//text()", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
words = []
for(let i = 0; i < result.snapshotLength; i++) {
let node = result.snapshotItem(i);
target = node.nodeValue.trim();
if (target.length>0) {
words.push(target)}
}
console.log(words);
The output is your expected array.
Upvotes: 1
Reputation: 22490
Update: You can select all needed HTML and then use a regex.
var result = [];
const regex = />([a-zA-Z \d\!]+)</gm;
const str = document.querySelectorAll('body *:not(style,script)')[0].innerHTML;
let m;
while ((m = regex.exec(str)) !== null) {
result.push(m[1]);
}
console.log(result);
<div>
<div>
<p>Hello</p>
<div>
<p>There</p>
</div>
<div>World</div>
</div>
<div>
<div>We are back<span>Yeah!</span></div>
<div>
<p>Members</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>Satt</td>
<td>10</td>
</tr>
<tr>
<td>Matt</td>
<td>20</td>
</tr>
</table>
</div>
</div>
Follow this link for more information about the regex: https://regex101.com/r/NF7sXZ/1/
As pointed out by charlietfl in the comments the first answer, does not work with the following markup:
<div>We are back <span>Yeah!</span></div>
Because that markup was not part of the question this might still be a valid solution:
var result = [];
var items = document.querySelectorAll('body div, body p, body th, body td, body span')
// you could obviously also use the same selector as in the updated answer above
items.forEach(item => {
if(1 === item.childNodes.length) { // check if there is no more childNodes, means there is only text inside this element
result.push(item.innerText)
}
})
console.log(result)
<div>
<div>
<p>Hello</p>
<div>
<p>There</p>
</div>
<div>World</div>
</div>
<div>
<div>We are back<span>Yeah!</span></div>
<div>
<p>Members</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>Satt</td>
<td>10</td>
</tr>
<tr>
<td>Matt</td>
<td>20</td>
</tr>
</table>
</div>
</div>
Upvotes: 3