thesamiroli
thesamiroli

Reputation: 472

Getting all the text content from a HTML string in NodeJS

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?

Upvotes: 7

Views: 18789

Answers (5)

Ramy Hadid
Ramy Hadid

Reputation: 162

Convert HTML to Plain Text:

In your terminal, install the html-to-text npm package:

npm install html-to-text

Then in JavaScript::

const { convert } = require('html-to-text'); // Import the library

var htmlString = `
<ul>
  <li>First</li>
  <li>Second</li>
</ul>
`;

var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
  • Hope this helps!

Upvotes: 7

vikash vik
vikash vik

Reputation: 686

You can try using npm library htmlparser2. Its will be very simple using this

const htmlparser2 = require('htmlparser2');

const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
    ontext(text) {
      if (text && text.trim().length > 0) {
        //do as you need, you can concatenate or collect as string array
      }
    }
  });

parser.write(htmlString);
parser.end();

Upvotes: 0

Hakan Demir
Hakan Demir

Reputation: 327

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom) With power of cheerio and request in your hands, you could write

const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");

//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
    var r = new RegExp('^(?:[a-z]+:)?//', 'i');
    return r.test(url);
}

function is_local(url)
{
    var r = new RegExp('^(?:file:)?//', 'i');
    return (r.test(url) || !is_absolute(url));
}

function send_request(URL)
    {
        if(is_local(URL))
        {
            if(URL.slice(0,7)==="file://")
                url_tmp = URL.slice(7,URL.length);
            else
                url_tmp = URL;

           //taken from https://stackoverflow.com/a/20665078/10713877
           const $ = cheerio.load(fs.readFileSync(url_tmp));
           //Do something
           console.log($.text())
        }
        else
        {
            var options = {
                url: URL,
                headers: {
                  'User-Agent': 'Your-User-Agent'
                }
              };

            request(options, function(error, response, html) {
                //no error
                if(!error && response.statusCode == 200)
                {
                    console.log("Success");

                    const $ = cheerio.load(html);


                    return Promise.resolve().then(()=> {
                        //Do something
                        console.log($.text())
                    });
                }
                else
                {
                    console.log(`Failure: ${error}`);
                }
            });
        }
    }

Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text. There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.

Below lines will work

//your local file
send_request("/absolute/path/to/your/local/index.html"); 
send_request("/relative/path/to/your/local/index.html"); 
send_request("file:///absolute/path/to/your/local/index.html"); 
//website
send_request("https://stackoverflow.com/"); 

EDIT: I am on a linux system.

Upvotes: 1

Dupinder Singh
Dupinder Singh

Reputation: 7789

Okay you can try this example, This may help you

I used JSDom module

https://www.npmjs.com/package/jsdom

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent); 

BTW Helped me enter image description here

This code can help I think :)

Upvotes: 1

elvira.genkel
elvira.genkel

Reputation: 1333

You can try get rid of html tags using regex, for the yours example try the following:

let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`

console.log(str)

let regex = '<\/?!?(li|ul)[^>]*>'

var re = new RegExp(regex, 'g');

str = str.replace(re, '');
console.log(str)

Upvotes: 2

Related Questions