Angel
Angel

Reputation: 2875

How to get the accent/diacritic of a letter in javascript?

I want to get the accent/diacritic of a letter in javascript.

For example:

I tried using .normalize("NFD") but it doesn't return the correct accent/diacritc

string = "á"
string.normalize("NFD").split("")
// ['a', '́']
string.normalize("NFD").split("").includes("´") 
// false
'́' === "´"
// false

I want NFD or any other function to give the accent/diacritic instead of the combining accent/diacritic

Upvotes: 9

Views: 1968

Answers (3)

Socko
Socko

Reputation: 745

The short answer is because COMBINING TILDE != TILDE

Here's a breakdown of each of the Unicode characters potentially involved in ñ for example:

Symbol Code CodePoint Name
ñ \u00F1 241 LATIN SMALL LETTER N WITH TILDE
n \u006E 110 LATIN SMALL LETTER N
̃ \u0303 771 COMBINING TILDE
~ \u007E 126 TILDE

In order to be able to separate out the diacritical marks from their attached characters, you can use string.normalize with "NFD" which provides the "Canonical Decomposition", breaking up a single glyph into different character combinations that result in the same symbol.

There are 112 different combining diacritical marks. I can't find a native way to convert between the combining character and it's solo counterpart. You could look for a library or write the mapping yourself for marks you want to handle like this:

const combiningMarks = {
  771: 126, // tilde
  769: 180, // acute accent
  768: 96,  // grave accent
}

Then decompose to separate chars and lookup the associated mark for each combining char like this:

const combiningMarks = {
  771: 126, // tilde
  769: 180, // acute accent
  768: 96,  // grave accent
}

const startingString = "ñáè" // "\u00F1\u00E1\u00E8"
const decomposedString = startingString.normalize("NFD") // "\u006E\u0303\u0061\u0301\u0065\u0300"
const codepoints = [...decomposedString].map(c => c.codePointAt(0)) // [110, 771, 97, 769, 101, 768]
const charsWithFullMarks = codepoints.map(c => combiningMarks[c] || c) // [110, 126, 97, 180, 101, 96]
const finalString = String.fromCodePoint(...charsWithFullMarks) // "n~a´e`"
console.log(finalString);

Upvotes: 9

Mohsen Alyafei
Mohsen Alyafei

Reputation: 5557

Your method to get the accent/diacritic of a letter is correct by using the string.normalize("NFD").split("").

The normalize("NFD") returns the correct result which in this case is the Combining Acute Accent Unicode Decimal Code &#769.

However, what you are doing is comparing the output for the letter á from the normalize("NFD") which is the Combining Acute Accent (char code 769) with the Normal Acute Accent (char code 180). Of course, these are 2 different letters.

The same applies to the letter è which has the Combining Grave Accent (char code 768); and you are comparing it to the Normal Grave Accent (char code 96) which we use and type from the keyboard; they are 2 different letters.

The standalone (normal) letters (including the Acute Accent and the Grave Accent letters) will always be separate letters even if they come before or after any other letters in a string. However, the combining forms of the letters (which have different chars codes to distinguish them) will go above or below the next or previous letter they are adjacent to. This is similar in Arabic accent letters and other languages like Hebrew.

Here are comparisons of some of the tilde letters:

enter image description here

See below the examples:

console.log("á".normalize("NFD").split("")[1].charCodeAt()); // 769 code for Combining Acute Accent
console.log("´".charCodeAt());                               // 180 code for Normal Acute Accent

console.log("è".normalize("NFD").split("")[1].charCodeAt()); // 768 Combining Grave Accent
console.log("`".charCodeAt());                               // 96 Normal Grave Accent

console.log("á".normalize("NFD").split("")[1].charCodeAt()); // 769 code for Combining Acute Accent
console.log("´".charCodeAt());                               // 180 code for Normal Acute Accent
    
console.log("è".normalize("NFD").split("")[1].charCodeAt()); // 768 Combining Grave Accent
console.log("`".charCodeAt());                               // 96 Normal Grave Accent

Upvotes: 4

the Hutt
the Hutt

Reputation: 18408

The Canonical Decomposition(NFD) does work:

let accent = '\u0301';
console.log(`Accute accent: ${accent}`);
let out = 'á'.normalize('NFD').split('').includes(accent);
console.log(out);

Extended demo: You don't have to maintain a map for all characters. In the demo map has been used only to get description of diacritics.

const dc = getDiacritics();
let inputString = 'n\u0303  \u00F1áèâäåçč السّلَامُ عَلِيْكُمُ';
inputString = inputString.normalize('NFD');

inpt.innerText = inputString;

out.innerHTML = getWholeChars(inputString).map(c => {
  let chars = c.split('');
  return `${c} -> <span>${chars[1] + '\u25cc'}</span> ${dc[chars[1]]}\n`;
}).join('');

// get characters with diacritics
function getWholeChars(str) {
  // add diacritics to the expression to extend capabilities
  var re = /.[\u064b-\u065F]+|.[\u0300-\u036F]+/g;
  var match, matches = [];
  while (match = re.exec(str))
    matches.push(match[0]);
  return matches;
}

// list taken from 
// https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
// And
// https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
function getDiacritics() {
  return {
    '\u0300': 'Grave Accent',
    '\u0301': 'Acute Accent',
    '\u0302': 'Circumflex Accent',
    '\u0303': 'Tilde',
    '\u0304': 'Macron',
    '\u0305': 'Overline',
    '\u0306': 'Breve',
    '\u0307': 'Dot Above',
    '\u0308': 'Diaeresis',
    '\u0309': 'Hook Above',
    '\u030A': 'Ring Above',
    '\u030B': 'Double Acute Accent',
    '\u030C': 'Caron',
    '\u030D': 'Vertical Line Above',
    '\u030E': 'Double Vertical Line Above',
    '\u030F': 'Double Grave Accent',
    '\u0310': 'Candrabindu',
    '\u0311': 'Inverted Breve',
    '\u0312': 'Turned Comma Above',
    '\u0313': 'Comma Above',
    '\u0314': 'Reversed Comma Above',
    '\u0315': 'Comma Above Right',
    '\u0316': 'Grave Accent Below',
    '\u0317': 'Acute Accent Below',
    '\u0318': 'Left Tack Below',
    '\u0319': 'Right Tack Below',
    '\u031A': 'Left Angle Above',
    '\u031B': 'Horn',
    '\u031C': 'Left Half Ring Below',
    '\u031D': 'Up Tack Below',
    '\u031E': 'Down Tack Below',
    '\u031F': 'Plus Sign Below',
    '\u0320': 'Minus Sign Below',
    '\u0321': 'Palatalized Hook Below',
    '\u0322': 'Retroflex Hook Below',
    '\u0323': 'Dot Below',
    '\u0324': 'Diaeresis Below',
    '\u0325': 'Ring Below',
    '\u0326': 'Comma Below',
    '\u0327': 'Cedilla',
    '\u0328': 'Ogonek',
    '\u0329': 'Vertical Line Below',
    '\u032A': 'Bridge Below',
    '\u032B': 'Inverted Double Arch Below',
    '\u032C': 'Caron Below',
    '\u032D': 'Circumflex Accent Below',
    '\u032E': 'Breve Below',
    '\u032F': 'Inverted Breve Below',
    '\u0330': 'Tilde Below',
    '\u0331': 'Macron Below',
    '\u0332': 'Low Line',
    '\u0333': 'Double Low Line',
    '\u0334': 'Tilde Overlay',
    '\u0335': 'Short Stroke Overlay',
    '\u0336': 'Long Stroke Overlay',
    '\u0337': 'Short Solidus Overlay',
    '\u0338': 'Long Solidus Overlay',
    '\u0339': 'Right Half Ring Below',
    '\u033A': 'Inverted Bridge Below',
    '\u033B': 'Square Below',
    '\u033C': 'Seagull Below',
    '\u033D': 'X Above',
    '\u033E': 'Vertical Tilde',
    '\u033F': 'Double Overline',
    '\u0340': 'Grave Tone Mark',
    '\u0341': 'Acute Tone Mark',
    '\u0342': 'Greek Perispomeni',
    '\u0343': 'Greek Koronis',
    '\u0344': 'Greek Dialytika Tonos',
    '\u0345': 'Greek Ypogegrammeni',
    '\u0346': 'Bridge Above',
    '\u0347': 'Equals Sign Below',
    '\u0348': 'Double Vertical Line Below',
    '\u0349': 'Left Angle Below',
    '\u034A': 'Not Tilde Above',
    '\u034B': 'Homothetic Above',
    '\u034C': 'Almost Equal To Above',
    '\u034D': 'Left Right Arrow Below',
    '\u034E': 'Upwards Arrow Below',
    '\u034F': 'Grapheme Joiner',
    '\u0350': 'Right Arrowhead Above',
    '\u0351': 'Left Half Ring Above',
    '\u0352': 'Fermata',
    '\u0353': 'X Below',
    '\u0354': 'Left Arrowhead Below',
    '\u0355': 'Right Arrowhead Below',
    '\u0356': 'Right Arrowhead And Up Arrowhead Below',
    '\u0357': 'Right Half Ring Above',
    '\u0358': 'Dot Above Right',
    '\u0359': 'Asterisk Below',
    '\u035A': 'Double Ring Below',
    '\u035B': 'Zigzag Above',
    '\u035C': 'Double Breve Below',
    '\u035D': 'Double Breve',
    '\u035E': 'Double Macron',
    '\u035F': 'Double Macron Below',
    '\u0360': 'Double Tilde',
    '\u0361': 'Double Inverted Breve',
    '\u0362': 'Double Rightwards Arrow Below',
    '\u0363': 'Latin Small Letter A',
    '\u0364': 'Latin Small Letter E',
    '\u0365': 'Latin Small Letter I',
    '\u0366': 'Latin Small Letter O',
    '\u0367': 'Latin Small Letter U',
    '\u0368': 'Latin Small Letter C',
    '\u0369': 'Latin Small Letter D',
    '\u036A': 'Latin Small Letter H',
    '\u036B': 'Latin Small Letter M',
    '\u036C': 'Latin Small Letter R',
    '\u036D': 'Latin Small Letter T',
    '\u036E': 'Latin Small Letter V',
    '\u036F': 'Latin Small Letter X',
    '\u064B': 'Arabic Fathatan',
    '\u064C': 'Arabic Dammatan',
    '\u064D': 'Arabic Kasratan',
    '\u064E': 'Arabic Fatha',
    '\u064F': 'Arabic Damma',
    '\u0650': 'Arabic Kasra',
    '\u0651': 'Arabic Shadda',
    '\u0652': 'Arabic Sukun',
    '\u0653': 'Arabic Maddah Above',
    '\u0654': 'Arabic Hamza Above',
    '\u0655': 'Arabic Hamza Below',
    '\u0656': 'Arabic Subscript Alef',
    '\u0657': 'Arabic Inverted Damma',
    '\u0658': 'Arabic Mark Noon Ghunna',
    '\u0659': 'Arabic Zwarakay',
    '\u065A': 'Arabic Vowel Sign Small V Above',
    '\u065B': 'Arabic Vowel Sign Inverted Small V Above',
    '\u065C': 'Arabic Vowel Sign Dot Below',
    '\u065D': 'Arabic Reversed Damma',
    '\u065E': 'Arabic Fatha With Two Dots',
    '\u065F': 'Arabic Wavy Hamza Below',
  };
}
body {
  padding: 1rem;
}

pre>span {
  font-size: 2rem;
  font-weight: bold;
}

#inpt {
  font-family: 'Courier New', Courier, monospace;
  font-weight: bold;
  font-size: 2rem;
}
Input: <span id="inpt"></span>
<pre id="out"></pre>

For other languages, add diacritics to the regular expression in getWholeChars.

Upvotes: 3

Related Questions