Reputation: 13735

How to match cjk characters with sed?

I'd like to match CJK characters. But the following regex [[:alpha:]]\+ does not work. Does anybody know to match CJK characters?

$ echo '程 a b' | sed -e 's/\([[:alpha:]]\+\)/x\1/g'
程 xa xb

The desired the output is x程 a b.

Upvotes: 1

Answers (2)

Reputation: 627082

With Perl, your solution will look like

perl -CSD -Mutf8 -pe 's/\p{Han}+/x$&/g' filename

Or, with older Perl versions before 5.20, use a capturing group:

perl -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

To modify file contents inline add -i option:

perl -i -CSD -Mutf8 -pe 's/(\p{Han}+)/x$1/g' filename

NOTES

\p{Han} matches a single Chinese character, \{Han}+ matches chunks of 1 or more Chinese characters
$1 is the backreference to the value captured with (\p{Han}+), $& replaces with the whole match value
-Mutf8 lets Perl recognize the UTF8-encoded characters used directly in your Perl code
-CSD (equivalent to -CIOED) allows input decoding and output re-encoding (it will work for UTF8 encoding).

Upvotes: 0

Reputation: 22032

As @WiktorStribiżew suggests, it will be easier to use perl.
If Perl is your option, please try the following:

echo "程 a b" | perl -CIO -pe 's/([\p{Script_Extensions=Han}])/x\1/g'

Output:

x程 a b

Upvotes: 2