JavaScript function to convert unicode pseduo-alphabet to regular characters?

Question

I am trying to write a function that takes any string containing characters in the unicode pseduo-alphabets and returns an equivalent string where such characters have been replaced with the regular characters found in ASCII.

const toRegularCharacters = s => {
  // ?
};

toRegularCharacters('ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ'); // "hello, world"
toRegularCharacters('𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭'); // "hello, world"
toRegularCharacters('ん乇ﾚﾚo, wo尺ﾚd'); // "hello, world"

I don't want to write a look-up table myself. I have looked at various "slugify" libraries, but they only remove accents etc. Ideally the function should work in Node and the browser.

Of course, not every special character will have a regular equivalent. The solution should make a reasonable guess in these cases (e.g. "尺" -> "R"). It should work flawlessly for the pseudo-alphabets with "true transforms":

Current true transforms: circled, negative circled, Asian fullwidth, math bold, math bold Fraktur, math bold italic, math bold script, math double-struck, math monospace, math sans, math sans-serif bold, math sans-serif bold italic, math sans-serif italic, parenthesized, regional indicator symbols, squared, negative squared, and tagging text (invisible for hidden metadata tagging).

From https://qaz.wtf/u/convert.cgi

How should I go about this?

Going from a "regular" string to a pseudo-alphabet one is implemented here: https://qaz.wtf/u/convert.cgi?text=hello%2C+world

sdgfsdh · Accepted Answer

Following the suggestion from this answer, this solution uses the unicode-12.1.0 NPM package:

const unicodeNames = require('unicode-12.1.0/Names');

const overrides = Object.freeze({
  'ん': 'h',
  '乇': 'E',
  'ﾚ': 'l',
  '尺': 'r',
  // ...
});

const toRegularCharacters = xs => {
  if (typeof xs !== 'string') {
    throw new TypeError('xs must be a string');
  }

  return [ ...xs ].map(x => {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x => x == 'CAPITAL');

    const isLetter = isCapital || names.some(x => x == 'SMALL');

    if (isLetter) {
      // e.g. "Ŧ" is named "LATIN CAPITAL LETTER T WITH STROKE"
      const c = names.some(x => x == 'WITH') ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join('');
};

console.log(
  toRegularCharacters('𝕩𝕩.𝕒𝕝𝕖𝕤𝕙𝕪.𝕩𝕩')
);

console.log(
  toRegularCharacters('🅰🅱🅲🅳-🅴🅵🅷')
);

console.log(
  toRegularCharacters('ん乇ﾚﾚo, wo尺ﾚd')
);

console.log(
  toRegularCharacters('ŦɆSŦƗNǤ')
);

The Names data-table contains the required information, but not in the best form, so there is some hacky string manipulation to get the character out.

A map of overrides is used for cases such as '尺'.

A better solution would extract the idn_mapping property as mentioned by @Seth.

JavaScript function to convert unicode pseduo-alphabet to regular characters?

Answers (2)

Related Questions