Reputation: 6744
I have a file with Korean and chinese characters. I want to find pairs where parenthetical statements are used to give the hanja for a Korean word, like this: 한문 (漢文)
The search would look something like this: /[korean characters] \([chinese characters]\)/
How do I specify the Chinese or Korean characters, or any other set such as Cyrillic or Thai for example?
Upvotes: 6
Views: 1294
Reputation: 386541
Unicode provides properties that identify to which script characters belong. Characters can be matched based on their script property using \p{Script=...}
.
I don't know much about the languages you mentioned, but I think you want
\p{Script=Han}
aka \p{Han}
for Chinese.\p{Script=Hangul}
aka \p{Hangul}
for Korean.\p{Script=Cyrillic}
aka \p{Cyrl}
for Cyrillic.\p{Script=Thai}
aka \p{Thai}
for Thai.You could take a look at perluniprops to find the one you are looking for, or you could use uniprops
* to find which properties match a specific character.
$ uniprops D55C
U+D55C ‹한› \N{HANGUL SYLLABLE HAN}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo
Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables
ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
X_POSIX_Graph X_POSIX_Print X_POSIX_Word
To find out which characters are in a given property, you can use unichars
*. (This is of limited usefulness since most CJK chars aren't named.)
$ unichars -au '\p{Han}'
⺀ U+2E80 CJK RADICAL REPEAT
⺁ U+2E81 CJK RADICAL CLIFF
⺂ U+2E82 CJK RADICAL SECOND ONE
⺃ U+2E83 CJK RADICAL SECOND TWO
⺄ U+2E84 CJK RADICAL SECOND THREE
...
$ unichars -au '\p{Hangul}'
ᄀ U+01100 HANGUL CHOSEONG KIYEOK
ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK
ᄂ U+01102 HANGUL CHOSEONG NIEUN
ᄃ U+01103 HANGUL CHOSEONG TIKEUT
ᄄ U+01104 HANGUL CHOSEONG SSANGTIKEUT
...
* — uniprops
and unichars
are available from the Unicode::Tussle distro.
Upvotes: 9