Reputation: 211
I want to find if a word contains digit and characters and if so separate the digit part and the character part. I want to check for tamil words, ex: ரூ.100
or ரூ100
. I want to seperate the ரூ.
and 100
, and ரூ
and 100
. How do i do it in python. I tried like this:
for word in f.read().strip().split():
for word1, word2, word3 in zip(word,word[1:],word[2:]):
if word1 == "ர" and word2 == "ூ " and word3.isdigit():
print word1
print word2
if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2):
print word1 print word2
Upvotes: 4
Views: 152
Reputation: 91518
Use unicode properties:
\pL
stands for a letter in any language
\pN
stands for a digit in any language.
In your case it could be:
(\pL+\.?)(\pN+)
Upvotes: 1
Reputation: 474141
You can use (.*?)(\d+)(.*)
regular expression, that will save 3 groups: everything before digits, digits and everything after:
>>> import re
>>> pattern = ur'(.*?)(\d+)(.*)'
>>> s = u"ரூ.100"
>>> match = re.match(pattern, s, re.UNICODE)
>>> print match.group(1)
ரூ.
>>> print match.group(2)
100
Or, you can unpack matched groups into variables, like this:
>>> s = u"100ஆம்"
>>> match = re.match(pattern, s, re.UNICODE)
>>> before, digits, after = match.groups()
>>> print before
>>> print digits
100
>>> print after
ஆம்
Hope that helps.
Upvotes: 4