shantanuo
shantanuo

Reputation: 32286

Extract unicode characters within a certain range from a string

I have a text file with a lot of junk characters.

https://raw.githubusercontent.com/shantanuo/marathi_spell_check/master/dicts/sample.txt

I need to keep only Devnagari characters. The expected clean output will look something like this...

भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव
गावापासून
गा

As per this page, I need to extract all characters between unicode range of U+090 to U+097 https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)


I tried this code, but it returns some foreign characters.

def remove_junk(word):
    mylist=list()
    for i in word:
        if b'9' in (i.encode('ascii', 'backslashreplace')):
            mylist.append(i)
    return (''.join(mylist))

with open('sample2a.txt', 'w') as nf:
    with open('sample.txt') as f:
        for i in f:
            nf.write(remove_junk(i) + '\n')

Upvotes: 0

Views: 915

Answers (2)

user11684454
user11684454

Reputation: 31

I don't know Python, but I guess it is possible to use Unicode properties in regular expressions just like in JavaScript, so it may possible to adapt the following script in some way, using the Devanagari script property:

var text =
`‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
(ページを閲覧しているビジターの使用言語)。
(缺少文字)
गावापासून
�गा`;
console.log (text.replace (/[^\r\n\p{Script=Devanagari}]/gu, ""));

which yields:

भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव



गावापासून
गा

Upvotes: 3

cs95
cs95

Reputation: 402852

You can remove all characters not within the unicode range U+0900-U+097F using regex.

import re

p = re.compile(r'[^\u0900-\u097F\n]')   # preserve the trailing newline
with open('sample.txt') as f, open('sample2a.txt', 'w') as nf:
    for line in f:
        cleaned = p.sub('', line)
        if cleaned.strip():
            nf.write(cleaned)

Minimal Code Sample

import re

text = '''
‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
(ページを閲覧しているビジターの使用言語)。
(缺少文字)
गावापासून
गा
'''

p = re.compile(r'[^\u0900-\u097F\n]')
for line in text.splitlines():
    cleaned = p.sub('', line)
    if cleaned.strip():
        print(cleaned)

# भूमी
# भूमी
# भूमीला
# भैय्यासाहेब
# भैरवनाथ
# भैरवी
# भैरव
# गावापासून 
# गा

Upvotes: 4

Related Questions