Mike Maxwell
Mike Maxwell

Reputation: 617

command line filtering of Unicode block

I've been trying for a couple hours to create a conceptually trivial filter that I can use on the command line, without success. The task is to filter out all lines containing Hangul Jamo characters, while retaining all other lines (which may contain ASCII, characters in the Hangul Syllable block, etc.).

So for example if the input was

 foo
 ᅤᆨ
 간

the output would contain the first and third lines, but not the second, since the second line contains Jamo characters. (The above is not meant to be real Korean, just a simple test case.)

I'm very disappointed with the Gnu grep utility (version 2.20). I would have thought the ff. would work:

grep -Pv '[\x{1100}-\x{11FF}]'

but instead I get the error message grep: character value in \x{...} sequence is too large. (The \u1100 syntax, which is the actual Perl syntax, simply isn't supported.)

(I do notice that our version 2.20 is rather old. If someone tries the above with a newer version of grep, and it works, I'll certainly consider that an answer--and I'll get our IT folks to upgrade!)

I tried sed, but didn't get any further. (Sorry, I don't remember exactly what sed commands I tried, but sed's support for Unicode blocks doesn't seem any better than grep's.)

Finally, I tried perl (v5.16.3):

 perl -ne 'print unless /[\u1100-\u11ff]/'

This at least succeeds in eliminating the Jamo lines while retaining the Hangul Syllable lines, but it also eliminates the ASCII lines, which I don't want to do. I also would have thought one of the ff. would work:

perl -ne 'print unless /\p{InHangul_Jamo}/'
perl -ne 'print unless /\p{Block: Hangul_Jamo}/'

but neither appears to have any effect. (Afaik, I shouldn't have to have a .* on each side of the \p{...}, but I tried that too; no luck.)

Locale: in case it matters, I have LANG=en_US.UTF-8.

I'm sure I could do this in Python, but I'd like to understand why neither grep nor perl seems to work, because they'd be a lot simpler. (And if I'm right about the Gnu utilities having poor Unicode support, why that is...and when it will be fixed. It's not like Unicode is new!) Of course I realize the problem may be that I'm not holding my mouth right when I try, but if so, it would be nice for grep at least to have better documentation on Unicode usage. Right now the documentation for grep -P says "This is highly experimental and grep -P may warn of unimplemented features." And it seems to have been that way roughly forever.

Upvotes: 2

Views: 425

Answers (2)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2821

for what it's worth, this is my own jamo filter in gnu-grep :

noJamo is an alias for

ggrep -vP '[\x{1100}-\x{11FF}
            \x{A960}-\x{A97F}
            \x{D7B0}-\x{D7FF}
            \x{3130}-\x{318F}]'

However, if you only care about the core Jamo set that maps to 11,172 syllables, and don't mind using something other than grep, then this should be extremely fast :

\341\204[\200-\222]|
\341\205[\241-\265]|
\341\206[\250-\277]|\341\207[\200-\202]

if you add up the octals in each line, they're exactly 19 cho in row 1, 21 jung in row 2, and 28 jong in row 3.

I did a quick benchmark with a synthetic 5.55 GB .txt file containing lines that add up to some 4.3 GB.

And this regex's filtering throughput was some 1.55 GB/sec, practically at the limit of my SSD I/O.

(time (pvE0 < jamotest000001.txt| 
       mawk2 'BEGIN{ FS=ORS }

  /\341(\204[\200-\222]|
        \205[\241-\265]|
        \206[\250-\277]|
        \207[\200-\202] )/' 

| pvE9 | xxh128sum))| ecp; 

 
      in0: 5.55GiB 0:00:03 [1.55GiB/s] [1.55GiB/s]
  [=================>] 100%            
     out9: 4.29GiB 0:00:03 [1.20GiB/s] [1.20GiB/s]
  [                      <=>          ]
( pvE 0.1 in0 < jamotest000001.txt | mawk2  | pvE 0.1 out9 | xxh128sum; ) 

 3.70s user 2.73s system 178% cpu 3.597 total

f4ef119214a3c39c7c560ad24491b96c  stdin

Upvotes: 0

ikegami
ikegami

Reputation: 385887

Decode inputs, encode outputs. If the encoding in question is UTF-8, the command-line switch -CSD will come in useful.

perl -CSD -ne'print if !/\p{Block: Hangul_Jamo}/'
perl -CSD -ne'print if !/\p{Block: Jamo}/'
perl -CSD -ne'print if !/\p{Blk=Jamo}/'
perl -CSD -ne'print if !/\p{InJamo}/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}]/'
grep -vP '[\x{1100}-\x{11FF}]'

You might want to add the Hangul_Jamo_Extended_A, Hangul_Jamo_Extended_B and Hangul_Compatibility_Jamo blocks.

perl -CSD -ne'print if !/[\p{Block: Hangul_Jamo}\p{Block: Hangul_Jamo_Extended_A}\p{Block: Hangul_Jamo_Extended_B}\p{Block: Hangul_Compatibility_Jamo}]/'
perl -CSD -ne'print if !/[\p{Block: Jamo}\p{Block: JamoExtA}\p{Block: JamoExtB}\p{Block: CompatJamo}]/'
perl -CSD -ne'print if !/[\p{Blk=Jamo}\p{Blk=JamoExtA}\p{Blk=JamoExtB}\p{Blk=CompatJamo}]/'
perl -CSD -ne'print if !/[\p{InJamo}\p{InJamoExtA}\p{InJamoExtB}\p{InCompatJamo}]/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}\N{U+A960}-\N{U+A97F}\N{U+D7B0}-\N{U+D7FF}\N{U+3130}-\N{U+318F}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]/'
grep -vP '[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]'

Let's look at your failed attempts.

  • grep -Pv '[\x{1100}-\x{11FF}]'

    Actually, this one should work, and it does for me.

    $ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | od -t x1
    0000000 61 62 63 0a 64 e1 84 80 66 0a 67 68 69 0a
    0000016
    
    $ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | grep -Pv '[\x{1100}-\x{11FF}]'
    abc
    ghi
    
    $ grep --version | head -1
    grep (GNU grep) 2.16
    

    I do get your error on an older machine with grep (GNU grep) 2.10.

  • perl -ne'print unless /\p{Block: Hangul_Jamo}/'

    You didn't get any matches from /\p{Block: Hangul_Jamo}/ because you were matching against encoded text (UTF-8 bytes, chars in the range 00..FF) instead of decoded text (Unicode Code Points, chars in the range 00000..10FFFF).

  • perl -ne 'print unless /\p{InHangul_Jamo}/'

    \p{Block: X}, \p{Blk=X} and \p{InX} are equivalent.

  • perl -ne'print unless /[\x{1100}-\x{11FF}]/'

    [\x{1100}-\x{11FF}] is equivalent to \p{Block: Hangul_Jamo}.

  • perl -ne'print unless /[\u1100-\u11ff]/'

    You got too many matches since \u in double-quoted string literals and in regex pattern literals titlecases the next character. (e.g. "\uxyx" is equivalent to "Xyz".)

    As such, [\u1100-\u11ff] is equivalent to [01f].

Upvotes: 2

Related Questions