user187676
user187676

Reputation:

Convert HTML to plain text in JS without browser environment

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

Upvotes: 38

Views: 84288

Answers (7)

gyula.nemeth
gyula.nemeth

Reputation: 867

With TextVersionJS (https://github.com/EDMdesigner/textversionjs) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

In node.js it looks like:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

(I copied the example from the page, you will have to npm install the module first.)

Upvotes: 12

Ga&#235;l Barbin
Ga&#235;l Barbin

Reputation: 3919

This simple regular expression works:

text.replace(/<[^>]*>/g, '');

It removes all anchors.

Entities, like &lt; does not contains <, so there is no issue with this regex.

Upvotes: 48

Killian Huyghe
Killian Huyghe

Reputation: 1624

If you want something accurate and can use npm packages, I would use html-to-text.

From the README:

const { htmlToText } = require('html-to-text');

const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
  wordwrap: 130
});
console.log(text); // Hello World

FYI, I found this on npm trends; html-to-text seemed like the best option for my use case but you can check out others here.

Upvotes: 1

Melounek
Melounek

Reputation: 890

Updated @EpokK answer for html to email text version use-case

const htmltoText = (html: string) => {
  let text = html;
  text = text.replace(/\n/gi, "");
  text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
  text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
  text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
  text = text.replace(/<\/div>/gi, "\n\n");
  text = text.replace(/<\/li>/gi, "\n");
  text = text.replace(/<li.*?>/gi, "  *  ");
  text = text.replace(/<\/ul>/gi, "\n\n");
  text = text.replace(/<\/p>/gi, "\n\n");
  text = text.replace(/<br\s*[\/]?>/gi, "\n");
  text = text.replace(/<[^>]+>/gi, "");
  text = text.replace(/^\s*/gim, "");
  text = text.replace(/ ,/gi, ",");
  text = text.replace(/ +/gi, " ");
  text = text.replace(/\n+/gi, "\n\n");
  return text;
};

Upvotes: 5

Dostonbek Oripjonov
Dostonbek Oripjonov

Reputation: 1674

You can try this way. textContent with innerText neither of them compatible with all browsers:

var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";

Upvotes: 7

Alberto Di Cagno
Alberto Di Cagno

Reputation: 11

It's pretty simple, you can also implement a "toText" prototype:

String.prototype.toText = function(){
    return $(html).text();
};

//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"

Upvotes: -2

EpokK
EpokK

Reputation: 38092

Converter HTML to plain text like Gmail:

html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, '  *  ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');

If you can use jQuery :

var html = jQuery('<div>').html(html).text();

Upvotes: 19

Related Questions