Reputation: 32286
I have a text file with a lot of junk characters.
https://raw.githubusercontent.com/shantanuo/marathi_spell_check/master/dicts/sample.txt
I need to keep only Devnagari characters. The expected clean output will look something like this...
भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव
गावापासून
गा
As per this page, I need to extract all characters between unicode range of U+090 to U+097 https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)
I tried this code, but it returns some foreign characters.
def remove_junk(word):
mylist=list()
for i in word:
if b'9' in (i.encode('ascii', 'backslashreplace')):
mylist.append(i)
return (''.join(mylist))
with open('sample2a.txt', 'w') as nf:
with open('sample.txt') as f:
for i in f:
nf.write(remove_junk(i) + '\n')
Upvotes: 0
Views: 915
Reputation: 31
I don't know Python, but I guess it is possible to use Unicode properties in regular expressions just like in JavaScript, so it may possible to adapt the following script in some way, using the Devanagari script property:
var text =
`‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
(ページを閲覧しているビジターの使用言語)。
(缺少文字)
गावापासून
�गा`;
console.log (text.replace (/[^\r\n\p{Script=Devanagari}]/gu, ""));
which yields:
भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव
गावापासून
गा
Upvotes: 3
Reputation: 402852
You can remove all characters not within the unicode range U+0900-U+097F using regex.
import re
p = re.compile(r'[^\u0900-\u097F\n]') # preserve the trailing newline
with open('sample.txt') as f, open('sample2a.txt', 'w') as nf:
for line in f:
cleaned = p.sub('', line)
if cleaned.strip():
nf.write(cleaned)
Minimal Code Sample
import re
text = '''
‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
(ページを閲覧しているビジターの使用言語)。
(缺少文字)
गावापासून
गा
'''
p = re.compile(r'[^\u0900-\u097F\n]')
for line in text.splitlines():
cleaned = p.sub('', line)
if cleaned.strip():
print(cleaned)
# भूमी
# भूमी
# भूमीला
# भैय्यासाहेब
# भैरवनाथ
# भैरवी
# भैरव
# गावापासून
# गा
Upvotes: 4