Go, Regular Expression : very challenging regex on Characters

Question

Do you think it is possible only with Regex?

Here is my try on Go Playground

This is successful with some dirty code

http://play.golang.org/p/YysZCB3vlu

I want expanded Korean characters to be converted a complete letter. For example, "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ" to 좋은값이싸요아침안녕하세요웬

For browser that don't render korean characters correctly:

좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬

The easy part is that Korean letter can only start with One Consonant + One or Two Vowel. That can be caught with (.([ㅏ-ㅣ])+).

The challenging part is Zero or One or Maximum Two Optional Consonants that follows the vowel. Another reason why it is hard is that after the maximum two optional consonants, we have another consonants that does not belong the previous letter and this consonants means another start of a new one letter.

Like below:

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

It is possible to catch all the patterns with if-condition and basic regex. But it would be good if I have shorter version of this.

My ultimate goal is to convert "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ" to 좋은값이싸요아침안녕하세요웬

For browser that don't render korean characters correctly:

좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬

GetSet · Accepted Answer

I don't know Korean, but it sounds like your possible input combinations are:

C(Consonant) V(Vowel)
CVV
CVVC
CVVCC
CVC
CVCC

So a regex rule to capture that (without capturing the first consonant of the next word) is: CV{1,2}C{0,2}(?!V)

Then you just need to define your C and V character classes, such as replacing V with [ㅏ-ㅣ]

Use your program to loop through the matches found in the string, and output the combined word

EDIT: Go doesn't support negative lookahead, so I suggest doing the following:

Reverse the string (something like How to reverse a string in Go?, but be careful with unicode byte sequences)
Run a match on C{0,2}V{1,2}C
Reverse each match and perform the word join/lookup

There are other ways of getting around the lack of negative lookahead, but it will probably involve a lot more code to manipulate where the next match will start in the input string.

Also, when defining the set of characters you will look for as vowels or consonants, it would be better to use the unicode escape sequence rather than the Korean glyphs themselves (normally, e.g., \x1161), but I'm not sure Go supports unicode reference in regex either...

Go, Regular Expression : very challenging regex on Characters

Answers (1)

Related Questions