cgi
cgi

Reputation: 29

Is this a secure way to convert html to text

I have a response from some semi-untrusted API, that is supposed to contain html. Now I want to to convert this to plaintext, basically strip out all the formatting so I can easily search it, then display (part of) it.

I have come up with this:

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `
    Some random untrusted string that is supposed to contain html. 
    Presumably some 'rich text'. 
    A few <div> or <p>, a link or two, a bit of <strong> and some such. 
    In any case not a complete html document.
`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

I think that this is safe/secure, because scripts are not executed as long as the div used for conversion is not inserted into the document.

Question: Is this safe/secure?

Upvotes: 2

Views: 717

Answers (1)

Kaiido
Kaiido

Reputation: 136707

No this is not safe at all.

function convertHtmlToText(html) {
    const div = document.createElement("div");
    // assumpton: because the div is not part of the document 
    // - no scripts are executed
    // - no layout pass
    div.innerHTML = html; 
    // assumption: whitespace is still normalized
    // assumption: this returns the text a user would see, if the element was inserted into the DOM.
    //             Minus the stuff that would depend on stylesheets anyway.
    return div.innerText; 
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

If you really can go only with the text content then prefer a DOMParser which will not execute any script:

function convertHtmlToText(html) {
  const doc = new DOMParser().parseFromString(html, 'text/html');
  return doc.body.innerText;
}

const html = `<img onerror="alert('Gotcha!')" src="">Hi`;

const text = convertHtmlToText(html);

const p = document.createElement("p");
p.textContent = text;
document.body.append(p);

But beware these methods will also catch the text content of nodes users can't normally see (e.g <style> or <script>).

Upvotes: 4

Related Questions