xiaohan2012
xiaohan2012

Reputation: 10342

Use regular expression to match ANY Chinese character in utf-8 encoding

For example, I want to match a string consisting of m to n Chinese characters, then I can use:

[single Chinese character regular expression]{m,n}

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

Upvotes: 50

Views: 106692

Answers (7)

Dr. Alex RE
Dr. Alex RE

Reputation: 1708

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

Recommendation

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, try the RE/flex[1] lexical analyzer generator, that I developed and others have contributed to, to extend Flex++ with Unicode and other useful features.

For example, you can write Unicode patterns (UTF-8 regular expressions) in lexer specifications:

%option flex unicode
%%
[肖晗]   { printf ("xiaohan/2\n"); }
%%

%option unicode enables Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

%option flex
%%
(?u:[肖晗])   { printf ("xiaohan/2\n"); }
(?u:\p{Han})  { printf ("Han character %s\n", yytext); }
.             { printf ("8-bit character %d\n", yytext[0]); }
%%

%option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls include wide char operations.

Background

In plain old Flex I ended up defining UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:

digit           [0-9]
alpha           ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id              ({alpha})({alpha}|{digit})*            

The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns.

Safety

When using UTF-8 directly in Flex patterns, there are several concerns:

  1. Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.

  2. Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.

  3. To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".

  4. Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).

  5. As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.

[1] https://github.com/Genivia/RE-flex

Upvotes: 2

Eli O.
Eli O.

Reputation: 2101

For most programming languages, the regular expression to match 99.9%+ Chinese characters will be:

\u4E00-\u9FFF

Works with: Python, modern Javascript, Golang, Rust but not PHP.

Useful if your language don't support notations like {Han}/{script=Han}/{IsCJKUnifiedIdeographs} in other answers.

NB: This corresponds to the CJK Unified Ideographs, and includes other languages like Korean, Japanese and Vietnamese.

Upvotes: 4

BiaowuDuan
BiaowuDuan

Reputation: 69

just like this:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    compile, err := regexp.Compile("\\p{Han}") // match one any Chinese character
    if err != nil {
        return
    }
    str := compile.FindString("hello 世界")
    fmt.Println(str) // output: 世
}

Upvotes: 0

Artem
Artem

Reputation: 2085

In C#

new Regex(@"\p{IsCJKUnifiedIdeographs}")

Here it is in the Microsoft docs

And here's more info from Wikipedia: CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).

Upvotes: 6

DayDayHappy
DayDayHappy

Reputation: 1679

In Java,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

Upvotes: 7

dripp
dripp

Reputation: 147

In Java 7 and up, the format should be: "\p{IsHan}"

Upvotes: 0

tchrist
tchrist

Reputation: 80423

The regex to match a Chinese (well, CJK) character is

\p{script=Han}

which can be appreviated to simply

\p{Han}

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

Upvotes: 53

Related Questions