Issue printing individual letters of a string

Question

I am trying to print individual letters, it works fine with english and chinese pinyin but when trying to work with other than those I get unicode chars (Diacritic) as well

Consider this word

महाभूकम्पले

when I try to separate it with key board arrow keys & space this is the result for महाभूकम्पले just like it would happen with an English word 'EXAMPLE'

E X A M P L E

 म हा भू क म्प ले

Now when I try to run a python scrip to automate this with this code

data= 'महाभूकम्पले'
index = 0
while index < len(data):
    letter = data[index]
    print (letter)
    index = index + 1

my result is this: (It has separated all Diacritic as well)

म
ह
ा
भ
ू
क
म
्
प
ल
े

What I require is to have an out put as this

म 
हा 
भू 
क 
म्प 
ले

saaj · Accepted Answer

A quick solution (hopefully) without digging into the codepoint semantics (otherwise better see Martin's answer). Basing on output of:

s = 'महाभूकम्पले'
for c in s:
    print(c, unicodedata.category(c))

Which is:

म Lo
ह Lo
ा Mc
भ Lo
ू Mn
क Lo
म Lo
् Mn
प Lo
ल Lo
े Mn

We can join codepoints in these categories (Mc, Mn) with preceding codepoint:

import unicodedata
from functools import reduce

def reducer(r, v):
    if unicodedata.category(v) in ('Mc', 'Mn'):
        r[-1] = r[-1] + v
    else:
        r.append(v)
    return r

print(reduce(reducer, 'महाभूकम्पले', []))

The output corresponds to number of combined character I have in gedit:

['म', 'हा', 'भू', 'क', 'म्', 'प', 'ले']

Issue printing individual letters of a string

Answers (2)

Related Questions