Reputation:
I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environment to convert HTML to plain text.
Currently I use this multi-stage regexp
html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
.replace(/<script([\s\S]*?)<\/script>/gi, ' ')
.replace(/(<(?:.|\n)*?>)/gm, ' ')
.replace(/\s+/gm, ' ');
while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?
Upvotes: 38
Views: 84288
Reputation: 867
With TextVersionJS (https://github.com/EDMdesigner/textversionjs) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.
In node.js it looks like:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
(I copied the example from the page, you will have to npm install the module first.)
Upvotes: 12
Reputation: 3919
This simple regular expression works:
text.replace(/<[^>]*>/g, '');
It removes all anchors.
Entities, like <
does not contains <, so there is no issue with this regex.
Upvotes: 48
Reputation: 1624
If you want something accurate and can use npm packages, I would use html-to-text.
From the README:
const { htmlToText } = require('html-to-text');
const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
wordwrap: 130
});
console.log(text); // Hello World
FYI, I found this on npm trends; html-to-text seemed like the best option for my use case but you can check out others here.
Upvotes: 1
Reputation: 890
Updated @EpokK answer for html to email text version use-case
const htmltoText = (html: string) => {
let text = html;
text = text.replace(/\n/gi, "");
text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
text = text.replace(/<\/div>/gi, "\n\n");
text = text.replace(/<\/li>/gi, "\n");
text = text.replace(/<li.*?>/gi, " * ");
text = text.replace(/<\/ul>/gi, "\n\n");
text = text.replace(/<\/p>/gi, "\n\n");
text = text.replace(/<br\s*[\/]?>/gi, "\n");
text = text.replace(/<[^>]+>/gi, "");
text = text.replace(/^\s*/gim, "");
text = text.replace(/ ,/gi, ",");
text = text.replace(/ +/gi, " ");
text = text.replace(/\n+/gi, "\n\n");
return text;
};
Upvotes: 5
Reputation: 1674
You can try this way. textContent
with innerText
neither of them compatible with all browsers:
var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";
Upvotes: 7
Reputation: 11
It's pretty simple, you can also implement a "toText" prototype:
String.prototype.toText = function(){
return $(html).text();
};
//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a> <br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"
Upvotes: -2
Reputation: 38092
Converter HTML to plain text like Gmail:
html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, ' * ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');
If you can use jQuery
:
var html = jQuery('<div>').html(html).text();
Upvotes: 19