choman
choman

Reputation: 787

Issue printing individual letters of a string

I am trying to print individual letters, it works fine with english and chinese pinyin but when trying to work with other than those I get unicode chars (Diacritic) as well

Consider this word

महाभूकम्पले

when I try to separate it with key board arrow keys & space this is the result for महाभूकम्पले just like it would happen with an English word 'EXAMPLE'

E X A M P L E

 म हा भू क म्प ले

Now when I try to run a python scrip to automate this with this code

data= 'महाभूकम्पले'
index = 0
while index < len(data):
    letter = data[index]
    print (letter)
    index = index + 1

my result is this: (It has separated all Diacritic as well)

म
ह
ा
भ
ू
क
म
्
प
ल
े

What I require is to have an out put as this

म 
हा 
भू 
क 
म्प 
ले

Upvotes: 1

Views: 90

Answers (2)

saaj
saaj

Reputation: 25283

A quick solution (hopefully) without digging into the codepoint semantics (otherwise better see Martin's answer). Basing on output of:

s = 'महाभूकम्पले'
for c in s:
    print(c, unicodedata.category(c))

Which is:

म Lo
ह Lo
ा Mc
भ Lo
ू Mn
क Lo
म Lo
् Mn
प Lo
ल Lo
े Mn

We can join codepoints in these categories (Mc, Mn) with preceding codepoint:

import unicodedata
from functools import reduce

def reducer(r, v):
    if unicodedata.category(v) in ('Mc', 'Mn'):
        r[-1] = r[-1] + v
    else:
        r.append(v)
    return r

print(reduce(reducer, 'महाभूकम्पले', []))

The output corresponds to number of combined character I have in gedit:

['म', 'हा', 'भू', 'क', 'म्', 'प', 'ले']

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1124748

Your data does actually contain 11 characters:

>>> data = 'महाभूकम्पले'
>>> len(data)
11

That's because there are several diacritical characters in there, which, when printed, combine with the preceding character. You'd have to detect these and print them together.

This is easier said than done.

The Unicode database has various ways of spelling characters that can be combined. In Western alphabets, you have diacritics like the cedille (the curl on the ç) or accents or tremas (á or ä), which in Unicode can be expressed as both 1 and two characters, where these forms are called the canonical composed normal form and canonical decomposed normal form, and you can use the unicodedata.normalize() function to convert between the two forms.

But for the Devanagari script, there is no composed form; diacritics are always specified separately. Instead, for these characters the line break behaviour is recorded in the lb table; how these should be handled when a line break needs to be inserted. For Devanagari diacritics, the behaviour is set to CM, or Combining Mark. The exact meaning is described in the Unicode Line Breaking Algorithm. CM is described as:

Class: CM
Descriptive Name: Combining Mark
Examples: Combining marks, control codes
Behaviour: Prohibit a line break between the character and the preceding character

The problem is that the lb data table is not available from the unicodedata module.

You'd have to build your own table, using the LineBreaks.txt table as a source, then test if the next character is in that table as CM and print it on the same line.

To just extract the CM codepoints:

cm_chars = set()
with open('LineBreak.txt') as lbtable:
    for line in lbtable:
        if ';CM' not in line:
            continue
        chars, category = line.partition(' ')[0].split(';')
        if category != 'CM':
            continue
        chars = chars.split('..')
        for codepoint in range(int(chars[0], 16), int(chars[-1], 16) + 1):
            cm_chars.add(chr(codepoint))

and then use this to detect if a next character is to be printed on the same line:

>>> data = 'महाभूकम्पले'
>>> index = 0
>>> while index < len(data):
...     letters = data[index]
...     while index + 1 < len(data) and data[index + 1] in cm_chars:
...         letters += data[index + 1]
...         index += 1
...     print(letters)
...     index += 1
...
म
हा
भू
क
म्
प
ले

This only covered CM characters, however. You probably also want to cover GL (Glue) characters, which attach to both the preceding and the next character in a sequence. For a more complete solution, you'd need to build a no_linebreak(current, next) function that took the whole lb table in to account to determine if a linebreak can exist between two characters.

Upvotes: 2

Related Questions