Reputation: 1483
I am still new to things like bash and perl and need some help with a task. I am in the process of preparing (adding and editing) a large Khmer Unicode corpus to be used with a patch for ICU Khmer word-breaking.
So far I have been unable to find a stable solution to automatically remove all English letters and punctuation (leaving just Khmer).
I was told that Perl might be the way to go, but I am not sure where to start (I'm not really a programmer).
I have used a bash script in the past, but the results were not perfect (I ended up having to check the list by hand and remove non-Khmer characters).
Here's some suggestions I've had in the past:
LC_ALL=POSIX sort khmerdict.txt | sed '/[[:punct:]]/d' > khmer-sorted.txt
Which should remove the punctuation...but for some reason it removed a lot of lines in my file, so it was useless.
And this:
sed -e 's/[a-zA-Z]//g' -e 's// /g' -e 's/\t/ /g' -e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' -e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' -e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' -e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
Which was another try at removing English letters, punctuation as well as all Khmer numbers...but like I said, it didn't work with perfect accuracy.
Does anyone have an idea of a stable solution for this that would work well with Khmer Unicode? Maybe there is a way to remove everything using a range of Unicode characters (Khmer Unicode Mapping PDF)?
If you want to try something on the dictionary you can download a test version here: http://www.sbbic.org/Khmer-Unicode-Wordlist.zip
And here is a short list to play around with:
កំណត់
--
ស្រូវ
ទម្លាប់
}
é
"សំយុង
"លើក"
"ព"
"ផ"
ទស្សន--
–សម្ភាស
ចម្ងាយahead
ទាត់១
Thanks, Nathan
Upvotes: 1
Views: 728
Reputation: 39158
perl -CS -Mutf8 -lpe's/[^ក-៝៰-៹]//g' < mixed.UTF-8.txt > khmer-only-no-digits.UTF-8.txt
It's a negated character class.
Upvotes: 1
Reputation: 224839
Some versions of sed might support non-ASCII, multibyte encodings, but I would just use Perl where the Unicode support is probably more reliable (and even readable: you can use block names and reference out special characters without having to use them literally).
Keep CR, LF, ZERO WIDTH NON-JOINER, and all characters from the Khmer and Khmer Symbols blocks:
perl -CIO -pe '
s/[^\r\n\x{200C}\p{Khmer}\p{KhmerSymbols}]+//g; # characters to keep
' <input >output
Same as above but also stripping Khmer digits (U+17E0–U+17E9):
perl -CIO -pe '
s/[^\r\n\x{200C}\p{Khmer}\p{KhmerSymbols}]+//g; # characters to keep
s/[\x{17E0}-\x{17E9}]+//g; # more characters to drop
' <input >output
I tested with Perl 5.8.9, Perl 5.10.0 and Perl 5.12.1.
Remove \p{KhmerSymbols}
if you do not want to keep the characters from the Khmer Symbols block.
The input should be UTF-8 (your zipped test file was). The output will be UTF-8.
Here are some line statistics for your Khmer-Unicode-Wordlist.txt
(CRLF line breaks):
\x{200C}
from the above programs if you do not want to keep these ZWNJ.Upvotes: 5