Nathan
Nathan

Reputation: 1483

Perl Script to Remove All English from Large Unicode Text File

I am still new to things like bash and perl and need some help with a task. I am in the process of preparing (adding and editing) a large Khmer Unicode corpus to be used with a patch for ICU Khmer word-breaking.

So far I have been unable to find a stable solution to automatically remove all English letters and punctuation (leaving just Khmer).

I was told that Perl might be the way to go, but I am not sure where to start (I'm not really a programmer).

I have used a bash script in the past, but the results were not perfect (I ended up having to check the list by hand and remove non-Khmer characters).

Here's some suggestions I've had in the past:

LC_ALL=POSIX sort khmerdict.txt | sed '/[[:punct:]]/d' > khmer-sorted.txt

Which should remove the punctuation...but for some reason it removed a lot of lines in my file, so it was useless.

And this:

sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' -e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' -e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' -e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' -e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \

Which was another try at removing English letters, punctuation as well as all Khmer numbers...but like I said, it didn't work with perfect accuracy.

Does anyone have an idea of a stable solution for this that would work well with Khmer Unicode? Maybe there is a way to remove everything using a range of Unicode characters (Khmer Unicode Mapping PDF)?

If you want to try something on the dictionary you can download a test version here: http://www.sbbic.org/Khmer-Unicode-Wordlist.zip

And here is a short list to play around with:

កំណត់
--
ស្រូវ
ទម្លាប់
}
é
"សំយុង
"លើក"
"ព"
"ផ"
ទស្សន--
–សម្ភាស
ចម្ងាយahead
ទាត់១

Thanks, Nathan

Upvotes: 1

Views: 728

Answers (2)

daxim
daxim

Reputation: 39158

perl -CS -Mutf8 -lpe's/[^ក-៝៰-៹]//g' < mixed.UTF-8.txt > khmer-only-no-digits.UTF-8.txt

It's a negated character class.

Upvotes: 1

Chris Johnsen
Chris Johnsen

Reputation: 224839

Some versions of sed might support non-ASCII, multibyte encodings, but I would just use Perl where the Unicode support is probably more reliable (and even readable: you can use block names and reference out special characters without having to use them literally).

Keep CR, LF, ZERO WIDTH NON-JOINER, and all characters from the Khmer and Khmer Symbols blocks:

perl -CIO -pe '
    s/[^\r\n\x{200C}\p{Khmer}\p{KhmerSymbols}]+//g;   # characters to keep
' <input >output

Same as above but also stripping Khmer digits (U+17E0–U+17E9):

perl -CIO -pe '
    s/[^\r\n\x{200C}\p{Khmer}\p{KhmerSymbols}]+//g;   # characters to keep
    s/[\x{17E0}-\x{17E9}]+//g;                        # more characters to drop
' <input >output

I tested with Perl 5.8.9, Perl 5.10.0 and Perl 5.12.1.

Remove \p{KhmerSymbols} if you do not want to keep the characters from the Khmer Symbols block.

The input should be UTF-8 (your zipped test file was). The output will be UTF-8.

Here are some line statistics for your Khmer-Unicode-Wordlist.txt (CRLF line breaks):

  • 28378 total lines (the last one is missing a CR+LF)
  • 28052 lines with only “Khmer characters” (those from the Khmer (U+1780–U+17FF) or Khmer Symbols (U+19E0–U+19FF) blocks)
  • 308 lines with mixed characters (“Khmer characters” and others)
  • 18 lines without any “Khmer characters”
  • 51 lines with ZERO WIDTH NON-JOINER (U+200C)
    All of these occurred in the middle of a sequence of Khmer/Khmer Symbol characters.
    They may or may not be important for your purposes.
    Remove \x{200C} from the above programs if you do not want to keep these ZWNJ.

Upvotes: 5

Related Questions