Sachin Kumar
Sachin Kumar

Reputation: 69

Count number of characters in Indic language (Hindi, Tamil support all Indian languages)

Is there any optimal way to implement character count for Indic language like Hindi Tamil. For example, if we take the word "Mother" in English, it is a 6 letter word. But if you type the same word(माता) in Hindi, it is a two letter word(मा + ता), but the length of characters becomes 4. Is there any way to count the number of real characters?

माता   -> actual -> 4, Expected -> 2
जगदीश -> actual -> 5, Expected -> 4
क्रमश  -> actual -> 5, expected -> 3

Upvotes: 4

Views: 723

Answers (1)

Abhinav reddy Boddu
Abhinav reddy Boddu

Reputation: 1

I also have the same requirement. From what I have searched, there isnt any plug-and-play package to do it. see the problem with indic languages is, the माता word is considered as "ma" + "aa" (matra) + "tha" + "aa" (matra) so it becomes 4. to avoid this you will have to hardcode the range of characters in Unicode that correspond to only full letters, and ignore characters.

Look into the Devanagari Unicode code block.

In the table, (U+090x4 to U+093x9) + (U+095x8 to U+095xF) will become normal characters, and others are matras, which you should ignore, so in the programming language you use, you should a .filter() or similar operation to find the number of characters.

Upvotes: -1

Related Questions