Nate Glenn
Nate Glenn

Reputation: 6744

Perl regex find character from arbitrary set

I have a file with Korean and chinese characters. I want to find pairs where parenthetical statements are used to give the hanja for a Korean word, like this: 한문 (漢文)

The search would look something like this: /[korean characters] \([chinese characters]\)/

How do I specify the Chinese or Korean characters, or any other set such as Cyrillic or Thai for example?

Upvotes: 6

Views: 1294

Answers (1)

ikegami
ikegami

Reputation: 386541

Unicode provides properties that identify to which script characters belong. Characters can be matched based on their script property using \p{Script=...}.

I don't know much about the languages you mentioned, but I think you want

  • \p{Script=Han} aka \p{Han} for Chinese.
  • \p{Script=Hangul} aka \p{Hangul} for Korean.
  • \p{Script=Cyrillic} aka \p{Cyrl} for Cyrillic.
  • \p{Script=Thai} aka \p{Thai} for Thai.

You could take a look at perluniprops to find the one you are looking for, or you could use uniprops* to find which properties match a specific character.

$ uniprops D55C
U+D55C ‹한› \N{HANGUL SYLLABLE HAN}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo
    Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables
    ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word
    XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
    X_POSIX_Graph X_POSIX_Print X_POSIX_Word

To find out which characters are in a given property, you can use unichars*. (This is of limited usefulness since most CJK chars aren't named.)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
...

$ unichars -au '\p{Hangul}'
 ᄀ U+01100 HANGUL CHOSEONG KIYEOK
 ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK
 ᄂ U+01102 HANGUL CHOSEONG NIEUN
 ᄃ U+01103 HANGUL CHOSEONG TIKEUT
 ᄄ U+01104 HANGUL CHOSEONG SSANGTIKEUT
...

* — uniprops and unichars are available from the Unicode::Tussle distro.

Upvotes: 9

Related Questions