sdgfsdh
sdgfsdh

Reputation: 37065

JavaScript function to convert unicode pseduo-alphabet to regular characters?

I am trying to write a function that takes any string containing characters in the unicode pseduo-alphabets and returns an equivalent string where such characters have been replaced with the regular characters found in ASCII.

const toRegularCharacters = s => {
  // ?
};

toRegularCharacters('ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ'); // "hello, world"
toRegularCharacters('𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭'); // "hello, world"
toRegularCharacters('ん乇レレo, wo尺レd'); // "hello, world"

I don't want to write a look-up table myself. I have looked at various "slugify" libraries, but they only remove accents etc. Ideally the function should work in Node and the browser.

Of course, not every special character will have a regular equivalent. The solution should make a reasonable guess in these cases (e.g. "尺" -> "R"). It should work flawlessly for the pseudo-alphabets with "true transforms":

Current true transforms: circled, negative circled, Asian fullwidth, math bold, math bold Fraktur, math bold italic, math bold script, math double-struck, math monospace, math sans, math sans-serif bold, math sans-serif bold italic, math sans-serif italic, parenthesized, regional indicator symbols, squared, negative squared, and tagging text (invisible for hidden metadata tagging).

How should I go about this?


Going from a "regular" string to a pseudo-alphabet one is implemented here: https://qaz.wtf/u/convert.cgi?text=hello%2C+world

Upvotes: 4

Views: 1925

Answers (2)

sdgfsdh
sdgfsdh

Reputation: 37065

Following the suggestion from this answer, this solution uses the unicode-12.1.0 NPM package:

const unicodeNames = require('unicode-12.1.0/Names');

const overrides = Object.freeze({
  'ん': 'h',
  '乇': 'E',
  'レ': 'l',
  '尺': 'r',
  // ...
});

const toRegularCharacters = xs => {
  if (typeof xs !== 'string') {
    throw new TypeError('xs must be a string');
  }

  return [ ...xs ].map(x => {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x => x == 'CAPITAL');

    const isLetter = isCapital || names.some(x => x == 'SMALL');

    if (isLetter) {
      // e.g. "Ŧ" is named "LATIN CAPITAL LETTER T WITH STROKE"
      const c = names.some(x => x == 'WITH') ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join('');
};

console.log(
  toRegularCharacters('𝕩𝕩.𝕒𝕝𝕖𝕤𝕙𝕪.𝕩𝕩')
);

console.log(
  toRegularCharacters('🅰🅱🅲🅳-🅴🅵🅷')
);

console.log(
  toRegularCharacters('ん乇レレo, wo尺レd')
);

console.log(
  toRegularCharacters('ŦɆSŦƗNǤ')
);

The Names data-table contains the required information, but not in the best form, so there is some hacky string manipulation to get the character out.

A map of overrides is used for cases such as '尺'.

A better solution would extract the idn_mapping property as mentioned by @Seth.

Upvotes: 1

T.J. Crowder
T.J. Crowder

Reputation: 1074525

You could write your code to query the Unicode database, which you can download from the Unicode consortium (or query via the character utility, but that's presumably rate-limited). The database includes things like what glyphs are "confusables" for other glyphs.

For instance, your 𝓱 from 𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭 is U+1D4F1, which has lots of confusables, one of which is of course the standard latin lower case h (U+0068). So you could go through each char in the input string, look it up, and if it had a latin a-z confusable (perhaps 0-9 as well), replace it with that.

It won't be perfect. As deceze pointed out, doesn't list any confusables, even if it does look vaguely like an "h" to an English reader. Neither does . So you may need to supplement with your own lookup even though you've said you don't want to (or just live with the imperfection).

Upvotes: 3

Related Questions