Reputation: 47
I am trying to create a speech recognition system for Sinhalese language. I tried to create a language model but following the answer in Build NEW Acoustic model, Dictionary , Language model for uncommon language speech recognition .I used both online lmtool and cmuclmtk-0.7-win32 on windows.My input file as follows,
එක eka
දෙක de ka
තුන thu na
හතර ha tha ra
පහ pa ha
හය ha iya
හත ha tha
අට ah ta
නවය na wa ya
After submitting to lmtool and cmuclmtk i got the output as follows,
AHTA AE T AH
DEKA D AH K AA
EKA EH K AH
HAIYA HH EY AY AH
HATHA HH AE TH AH
HATHARA HH AE TH AH R AH
NAWAYA N AO EY AH
PAHA P AE HH AH
THUNA TH UW N AH
අට
à¶à·”න
දෙක
නවය
පහ
à·„à¶
à·„à¶à¶»
හය
එක
both .dic and .lm files contains above characters. I feel these are some garbage characters. what did i do wrong to get this?
Upvotes: 1
Views: 279
Reputation: 25210
You did everything wrong.
For corpus construction you need a text file, not a dictionary file. You create dictionary separately.
You should not use online lmtool for your language. It works for English only.
To train language model from texts you should use srilm.
Upvotes: 1