Extract unicode characters within a certain range from a string

Question

I have a text file with a lot of junk characters.

https://raw.githubusercontent.com/shantanuo/marathi_spell_check/master/dicts/sample.txt

I need to keep only Devnagari characters. The expected clean output will look something like this...

भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव
गावापासून
गा

As per this page, I need to extract all characters between unicode range of U+090 to U+097 https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)

I tried this code, but it returns some foreign characters.

def remove_junk(word):
    mylist=list()
    for i in word:
        if b'9' in (i.encode('ascii', 'backslashreplace')):
            mylist.append(i)
    return (''.join(mylist))

with open('sample2a.txt', 'w') as nf:
    with open('sample.txt') as f:
        for i in f:
            nf.write(remove_junk(i) + '
')

cs95 · Accepted Answer

You can remove all characters not within the unicode range U+0900-U+097F using regex.

import re

p = re.compile(r'[^\u0900-\u097F
]')   # preserve the trailing newline
with open('sample.txt') as f, open('sample2a.txt', 'w') as nf:
    for line in f:
        cleaned = p.sub('', line)
        if cleaned.strip():
            nf.write(cleaned)

Minimal Code Sample

import re

text = '''
‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
（ページを閲覧しているビジターの使用言語）。
（缺少文字）
गावापासून
गा
'''

p = re.compile(r'[^\u0900-\u097F
]')
for line in text.splitlines():
    cleaned = p.sub('', line)
    if cleaned.strip():
        print(cleaned)

# भूमी
# भूमी
# भूमीला
# भैय्यासाहेब
# भैरवनाथ
# भैरवी
# भैरव
# गावापासून 
# गा

Extract unicode characters within a certain range from a string

Answers (2)

Related Questions