cagounai
cagounai

Reputation: 155

Java regex for special character

I have simple method for extract #hashTag from text:

private String[] buildHashTag(String str) {
        ArrayList<String> allMatches = new ArrayList<String>();
        Matcher m = Pattern.compile("(#\\w+)\\b").matcher(str);
        while (m.find()) {
            allMatches.add(m.group());
        }
        return allMatches.toArray(new String[0]);
    }

The problem is if i send string with special character, for example string "POMERANČ".

Test: INPUT:

#Orange in Czech language mean #pomeranč :-)

OUTPUT:

[#Orange]

But it is FAIL, output must be [#Orange, #pomeranč]. Can you tell me, where is the wrong code? Help me. Thank you.

Upvotes: 3

Views: 474

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627469

Add the Pattern.UNICODE_CHARACTER_CLASS modifier or use Pattern.compile("(?U)(#\\w+)\\b"). Otherwise, \b and \w do not match all Unicode characters.

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

Here is a demo:

String str = "#Orange in Czech language mean #pomeranč :-)";
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(?U)(#\\w+)\\b").matcher(str);
//                           ^^^^
while (m.find()) {
    allMatches.add(m.group());
}
System.out.println(Arrays.toString(allMatches.toArray()));

Output: [#Orange, #pomeranč]

Upvotes: 5

nu11p01n73R
nu11p01n73R

Reputation: 26677

Use negated character class instead

/#[^ ]+/
  • [^ ]+ Negated character class, matches anything other than a space, which will in effect match characters till the next space

Regex Demo

Upvotes: 1

Related Questions