Reputation: 59
I've some problem to stem words in my local language using rule based algorithm. so any body who are python literate can help me.
In my language some words are pluralized by repeating the first 2 or 3 characters(sounds).
For example
Diimaa (root word) ==> Diddiimaa(plural word)
Adii (root word) ==> Adadii(plural word)
so now i want my program to reject "Did" from the first example and "Ad" from the second example
the following is my code and it did not return any result
`def compput(mm):
vv=1
for i in mm:
if seevowel(i)==1:
inxt=mm.index(i)+1
if inxt<len(mm)-1 and seevowel(mm[inxt])==0:
vv=vv+1
return vv
def stemm_maker(tkn):
for i in range(len(tkn)):
if (i[0] == i[2] and i[1] == i[3]):
stem = i[2:]
if compput(stem) > 0:
return stem
elif ((i[0] == i[2] or i[0]== i[3]) and i[1] == i[4]):
stem = i[3:]
if compput(self) > 0:
return stem
else:
return tkn
print(stem)`
Upvotes: 3
Views: 310
Reputation: 59
This is the answer for my question posted on this page. I tried the following rule based code and it works correctly. I've checked my code with words assigned to jechoota
jechoota = "diddiimaa adadii babaxxee babbareedaa gaggaarii guguddaa hahhamaa hahapphii"
token = jechoota.split()
def stem(word):
if(word[0] == word[2] and word[1] == word[3]):
stemed = word[2:]
elif(word[0] == word[2] and word[0] == word[3] and word[1] == word[4]):
stemed = word[3:]
return stemed
for i in token:
print stem(i)
Upvotes: 2
Reputation: 215029
One way to attack this problem is with regular expressions.
Looking at these pairs (found here):
adadii adii
babaxxee baxxee
babbareedaa bareedaa
diddiimaa diimaa
gaggaarii gaarii
guguddaa guddaa
hahhamaa hamaa
hahapphii happhii
the rule appears to be
if the word starts with XY...
then the reduplicated word is either XYXY... or XYXXY...
In the regex language this can be expressed as
^(.)(.)\1?(?=\1\2)
which means:
char 1
char 2
maybe char 1
followed by
char 1
char 2
Complete example:
test = {
'adadii': 'adii',
'babaxxee': 'baxxee',
'babbareedaa': 'bareedaa',
'diddiimaa': 'diimaa',
'gaggaarii': 'gaarii',
'guguddaa': 'guddaa',
'hahhamaa': 'hamaa',
'hahapphii': 'happhii',
}
import re
def singularize(word):
m = re.match(r'^(.)(.)\1?(?=\1\2)', word)
if m:
return word[len(m.group(0)):]
return word
for p, s in test.items():
assert singularize(p) == s
Upvotes: 2