Danziger
Danziger

Reputation: 21161

How to use / parse HTML entities and Unicode characters in a JavaScript string

I want to use ‌,  , ° and ℃ in a JavaScipt string, but this doesn't work:

const str = `‌   ° ℃`;

If I do console.log(str), I would expect to see something like this (note the ‌ would not be visible and the   would just look like a regular space):

   ° ℃

I've seen this other question where the suggested solution is to change these entities to their hexadecimal equivalent, but that's not possible as this string comes from the backend with the entities already in place.

Upvotes: 1

Views: 3450

Answers (1)

Danziger
Danziger

Reputation: 21161

Even if the HTML entities are already in that string, one way or another, you need to replace them with their actual character or their escape notation equivalent.

If they were not in the string already, one option would be to just look them up:

Or calculate them:

Or, if you can type or copy-paste the original character from somewhere else, you can get its decimal Unicode code using String.prototype.charCodeAt(), which returns the UTF-16 decimal code unit at the given index, and Number.prototype.toString(), using its radix parameter to convert that decimal to hexadecimal:

'°'.charCodeAt(0); // 176
'°'.charCodeAt(0).toString(16); // "b0"

And then use the escape notation to represent them with their Unicode code. Note that depending on the code, we use the \uXXXX or the \xXX notation:

const str = `\u200C \xA0 \xB0 \u2103`;

console.log(str);

console.log(str.split(' ').map(s => `${ s.charCodeAt(0) } = ${ s.charCodeAt(0).toString(16) }`));

In your case, you need to parse that string, extract the entities and replace them with the actual character they represent.

I've made this snippet so that you can just paste characters or write HTML entities and get their Unicode codes, but this will also serve you as an example on how to dynamically parse those HTML entities:

const sandbox = document.getElementById('sandbox');
const input = document.getElementById('input');
const list = document.getElementById('list');

function parseInput() {
  let text = input.value;
  
  (text.match(/&.+;/ig) || []).forEach(entity => {
    // Insert the HTML entity as HTML in an HTML element:
    sandbox.innerHTML = entity;
    
    // Retrieve the HTML elements innerText to get the parsed entity (the actual character):
    text = text.replace(entity, sandbox.innerText);
  });
  
  list.innerHTML = text.split('').map(char => {
    const dec = char.charCodeAt(0);
    const hex = dec.toString(16).toUpperCase();
    const code = hex.length === 2 ? `\\x${ hex }` : `\\u${ hex }`;
    const link = `0000${ code }`.slice(-Math.min(4, hex.length ));
  
    return `
      <li>
        <div>${ char }</div>
        <div>${ dec }</div>
        <div>${ hex }</div>
        <div><a href="http://www.fileformat.info/info/unicode/char/${ link }">${ code }</a></div>
      </li>
    `;
  }).join('');  
}

input.value = '&zwnj;&nbsp;°℃';

input.oninput = parseInput;

parseInput();
body {
  margin: 0;
  padding: 8px;
  font-family: monospace;
}

#input {
  margin-bottom: 16px;
  border-radius: 2px;
  border: 0;
  padding: 8px;
  font-family: monospace;
  font-size: 16px;
  font-weight: bold;
  box-shadow: 0 0 32px rgba(0, 0, 0, .25);
  width: 100%;
  box-sizing: border-box;
  height: 40px;
  outline: none;
}

#sandbox {
  display: none;
}

#list {
  list-style: none;
  margin: 0; 
  padding: 0;
  border-top: 1px solid #EEE;
}

#list > li {
  display: flex;
  border-bottom: 1px solid #EEE;
}

#list > li > div {
  width: 25%;
  box-sizing: border-box;
  padding: 8px;
}

#list > li > div + div {
  border-left: 1px solid #EEE;
}
<div id="sandbox"></div>

<input type="text" id="input" />

<ul id="list"></ul>

Upvotes: 4

Related Questions