Extract text array from HTML in JavaScript

Question

We are doing dynamic translation of HTML type documents using translator service API (e.g., Azure). For that we need to strip the Markup and extract only the text part, because the APIs have character limit and we don't want to send useless markup characters to the API.

So if there is a HTML like below:



    Hello
    
        There
    
    World


    We are back
    
        Members
        
            
                Name
                Age
            
            
                Satt
                10
            
            
                Matt
                20

We want the text values in an array, like:

["Hello", "There", "World", "We are back", "Members", "Name", "Age", "Satt", "10", "Matt", "20"]

What is the best approach to do this? Should I use Regular expressions to parse and extract the HTML or should I use some kind of recursive algorithm to get the texts.

Any help is appreciated, Thanks.

caramba · Accepted Answer

Update: You can select all needed HTML and then use a regex.

var result = [];
const regex = />([a-zA-Z \d\!]+)




    Hello
    
        There
    
    World


    We are backYeah!
    
        Members
        
            
                Name
                Age
            
            
                Satt
                10
            
            
                Matt
                20



Follow this link for more information about the regex: https://regex101.com/r/NF7sXZ/1/

As pointed out by charlietfl in the comments the first answer, does not work with the following markup:
We are back Yeah!

Because that markup was not part of the question this might still be a valid solution:


var result = [];
var items = document.querySelectorAll('body div, body p, body th, body td, body span')
// you could obviously also use the same selector as in the updated answer above

items.forEach(item => {
  if(1 === item.childNodes.length) { // check if there is no more childNodes, means there is only text inside this element
    result.push(item.innerText)
  }
})

console.log(result)


    Hello
    
        There
    
    World


    We are backYeah!
    
        Members
        
            
                Name
                Age
            
            
                Satt
                10
            
            
                Matt
                20

Extract text array from HTML in JavaScript

Answers (2)

Related Questions