Reputation:
Do you think it is possible only with Regex?
Here is my try on Go Playground
This is successful with some dirty code
http://play.golang.org/p/YysZCB3vlu
I want expanded Korean characters to be converted a complete letter. For example, "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ" to 좋은값이싸요아침안녕하세요웬
For browser that don't render korean characters correctly:
좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬
The easy part is that Korean letter can only start with One Consonant + One or Two Vowel. That can be caught with (.([ㅏ-ㅣ])+
).
The challenging part is Zero or One or Maximum Two Optional Consonants that follows the vowel. Another reason why it is hard is that after the maximum two optional consonants, we have another consonants that does not belong the previous letter and this consonants means another start of a new one letter.
Like below:
ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ + ㅇㅣ
= 값 + 이
= 값이
It is possible to catch all the patterns with if-condition and basic regex. But it would be good if I have shorter version of this.
My ultimate goal is to convert "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ" to 좋은값이싸요아침안녕하세요웬
For browser that don't render korean characters correctly:
좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬
Upvotes: 4
Views: 621
Reputation: 532
I don't know Korean, but it sounds like your possible input combinations are:
C(Consonant) V(Vowel)
CVV
CVVC
CVVCC
CVC
CVCC
So a regex rule to capture that (without capturing the first consonant of the next word) is:
CV{1,2}C{0,2}(?!V)
Then you just need to define your C and V character classes, such as replacing V with [ㅏ-ㅣ]
Use your program to loop through the matches found in the string, and output the combined word
EDIT: Go doesn't support negative lookahead, so I suggest doing the following:
C{0,2}V{1,2}C
There are other ways of getting around the lack of negative lookahead, but it will probably involve a lot more code to manipulate where the next match will start in the input string.
Also, when defining the set of characters you will look for as vowels or consonants, it would be better to use the unicode escape sequence rather than the Korean glyphs themselves (normally, e.g., \x1161
), but I'm not sure Go supports unicode reference in regex either...
Upvotes: 2