dab1984
dab1984

Reputation: 47

Sinhala language model issue for pocketsphinx

I am trying to create a speech recognition system for Sinhalese language. I tried to create a language model but following the answer in Build NEW Acoustic model, Dictionary , Language model for uncommon language speech recognition .I used both online lmtool and cmuclmtk-0.7-win32 on windows.My input file as follows,

එක  eka
දෙක de ka
තුන thu na
හතර ha tha ra
පහ  pa ha
හය  ha iya
හත  ha tha
අට  ah ta
නවය na wa ya

After submitting to lmtool and cmuclmtk i got the output as follows,

AHTA    AE T AH
DEKA    D AH K AA
EKA EH K AH
HAIYA   HH EY AY AH
HATHA   HH AE TH AH
HATHARA HH AE TH AH R AH
NAWAYA  N AO EY AH
PAHA    P AE HH AH
THUNA   TH UW N AH
අට  
තුන   
දෙක   
නවය   
පහ  
හත  
හතර   
හය  
එක   

both .dic and .lm files contains above characters. I feel these are some garbage characters. what did i do wrong to get this?

Upvotes: 1

Views: 279

Answers (1)

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25210

You did everything wrong.

For corpus construction you need a text file, not a dictionary file. You create dictionary separately.

You should not use online lmtool for your language. It works for English only.

To train language model from texts you should use srilm.

Upvotes: 1

Related Questions