Reputation:
I am trying to separate English characters from non-English characters . But I see numbers are not retained. I want to use it in re.compile
. Any way to do it?
Code:
import re
a = 'Этап 51 Stage 51'
eng = re.compile(r'[^\u0041-\u024f]')
b=eng.sub(' ',a)
print('eng is >',b)
noneng = re.compile(r'[\u0041-\u024f]')
c=noneng.sub(' ',a)
print('noneng is>',c)
Output:
eng is > Stage
noneng> Этап 51 51
Expected Output:
eng is > Stage 51
noneng is> Этап 51
Upvotes: 0
Views: 83
Reputation: 42143
maketrans/translate
is generally faster than regular expressions for this kind of thing.
import string
noASCII = str.maketrans('','',string.printable) # ASCII only (not all unicode)
def onlyENG(a):
return a.translate(str.maketrans('','',a.translate(noASCII)))
noLetters = str.maketrans('','',string.ascii_letters)
def nonENG(a):
return a.translate(noLetters)
output:
onlyENG('Этап 51 Stage 51') # ' 51 Stage 51'
nonENG('Этап 51 Stage 51') # 'Этап 51 51'
Upvotes: 0
Reputation: 23089
Your first regular expression matches any character with a character code not between the hex values 41
and 24f
. The non-english characters in your input string are outside of this range. The 10 standard numeric digits have character codes between 30
and 39
hex, so they are also outside this range. So the first expression matches the digits and non-english characters in your input string and removes them. What you are left with is just the (non-digit) English characters.
Your second expression does the opposite, matching characters with codes in the range 41
to 24f
. It matches exactly what was not matched by the prior expression, just "Stage", and so those characters are removed and everything else is retained.
Your current expression is not complex enough to do what you want. No matter what range you use in your expressions, the two expressions will always match the opposite of each other, so if one does not eliminate digits, the other one will. What you want to do is write two expressions that match either English characters or non-English characters, but that both do not match the numeric digit characters.
Here is code that will take out either English or non-English text, but always retain digits. It just fixes your first expression to ignore digits:
import re
a = 'Этап 51 Stage 51'
eng = re.compile(r'(?=[^\u0041-\u024f])[^0-9]')
b=eng.sub(' ',a)
print('eng is >',b)
noneng = re.compile(r'[\u0041-\u024f]')
c=noneng.sub(' ',a)
print('noneng is>',c)
Result:
eng is > 51 Stage 51
noneng is> Этап 51 51
NOTE: I don't understand why you expect the output you show. Why would you ever expect that your code would remove one pair of digit characters but not the other? You must expect that something more complicated is going on here than that each character is being considered and possibly removed independently. But that's all that is going on here.
Upvotes: 1