Jeffrin Prabahar
Jeffrin Prabahar

Reputation: 115

How to get the first Tamil letter in a word?

I want to find the first Tamil letter in a string. For instance, in the string "யாத்திராகமம்", the first letter is யா.

When I naively try like this:

const word = "யாத்திராகமம்";
const firstLetter = word.match(/[^\w]/u);
console.log(firstLetter);

... the result is , which is not correct. It should be யா.

I then tried using the XRegExp library which can handle Unicode characters:

const tamilRegex = XRegExp("\\p{Tamil}", "ug");
const match = XRegExp.exec(word, tamilRegex);
return match;

But the above code still returns wrong results.

How to get the proper first Tamil letter in a word, using regex or any other way?

Upvotes: 6

Views: 1083

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89629

It's also possible to use Unicode character classes and Unicode properties to build the pattern:

s.match(/\p{Script=Tamil}\p{Diacritic}*/u)

Note that the Diacritic property isn't specific to combining characters for Tamil language.

Upvotes: 0

trincot
trincot

Reputation: 350941

I don't know the Tamil script, but Wikipedia explains the concept of compound letters in that script. The Tamil Unicode Block has characters in the range U+0B80 to U+0BFF, of which the subrange U+0BBE-U+0BCD, and one at U+0BD7 are suffixes that need to be combined with the preceding consonant to make it a compound letter.

Without any specialised library or smarter regex support, it seems you can make it work with the regex [\u0b80-\u0bff][\u0bbe-\u0bcd\u0bd7]?, which matches a character in the Tamil range, and in addition possibly one of those suffix codes.

let s = "this is Tamil: யாத்திராகமம்";

console.log("First Tamil character: ", s.match(/[\u0b80-\u0bff][\u0bbe-\u0bcd\u0bd7]?/u));

Upvotes: 8

Related Questions