Thanh Nguyen
Thanh Nguyen

Reputation: 53

Java Regular Expression with string utf8

I have 2 strings:

String 1 from txt file, open with BufferedReader use encoding "UTF-8":

Tân_Dậu 1921 – Kỉ_Mão 1999

String 2 is my type in:

Tân_Dậu 1921 - Kỉ_Mão 1999

and my string Pattern:

[(]?([A-ZTĐẤ][a-záââậầấẹịỉìíợnọúùửỵýỷ]+[_][A-ZDĐẤ][a-záậãâậầấẹuịìíợọúùửỵýỷ]+)?[ ]?((\\d{4})|([?]))[ ]?[-][ ]?(([A-ZĐKẤ][a-záâỉoậầấẹịỉìíợọúùửỵýỷ]+[_][A-ZĐẤ][a-záãâậầấãẹịìíợọúùửỵýỷ]+))?[ ]?(\\d{4}|\\d{2}[)])[ ]?[)]?

I use:

Matcher m = p.matcher(test.trim());
while(m.find())
{
    System.out.println("-->"+m.group());
}

With 'test' is string 1 and 2 . But only string 2 matched. What problem and how to slove it ? thanks for help.

Upvotes: 3

Views: 758

Answers (1)

npinti
npinti

Reputation: 52185

The problem is the -. You seem to have two versions of them. Changing your expression to this: [(]?([A-ZTĐẤ][a-záââậầấẹịỉìíợnọúùửỵýỷ]+[_][A-ZDĐẤ][a-záậãâậầấẹuịìíợọúùửỵýỷ]+)?[ ]?((\\d{4})|([?]))[ ]?[-–][ ]?(([A-ZĐKẤ][a-záâỉoậầấẹịỉìíợọúùửỵýỷ]+[_][A-ZĐẤ][a-záãâậầấãẹịìíợọúùửỵýỷ]+))?[ ]?(\\d{4}|\\d{2}[)])[ ]?[)]? should do the trick (example available here).

Notice how [-] has been changed to [-–].

Upvotes: 3

Related Questions