Reputation: 155
I have simple method for extract #hashTag
from text:
private String[] buildHashTag(String str) {
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(#\\w+)\\b").matcher(str);
while (m.find()) {
allMatches.add(m.group());
}
return allMatches.toArray(new String[0]);
}
The problem is if i send string with special character, for example string "POMERANČ".
Test: INPUT:
#Orange in Czech language mean #pomeranč :-)
OUTPUT:
[#Orange]
But it is FAIL, output must be [#Orange, #pomeranč]
. Can you tell me, where is the wrong code? Help me. Thank you.
Upvotes: 3
Views: 474
Reputation: 627469
Add the Pattern.UNICODE_CHARACTER_CLASS
modifier or use Pattern.compile("(?U)(#\\w+)\\b")
. Otherwise, \b
and \w
do not match all Unicode characters.
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
Here is a demo:
String str = "#Orange in Czech language mean #pomeranč :-)";
ArrayList<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("(?U)(#\\w+)\\b").matcher(str);
// ^^^^
while (m.find()) {
allMatches.add(m.group());
}
System.out.println(Arrays.toString(allMatches.toArray()));
Output: [#Orange, #pomeranč]
Upvotes: 5
Reputation: 26677
Use negated character class instead
/#[^ ]+/
[^ ]+
Negated character class, matches anything other than a space, which will in effect match characters till the next spaceUpvotes: 1