Reputation: 33
I'm using Java regex on Android and I'm seeing strange differences, as the following
Java: "COSÌ".replaceAll( "\\W", "" ) ----> "COS"
Android: "COSÌ".replaceAll( "\\W", "" ) ----> "COSÌ"
Anyone noticed similar differences between Java and Android String class?
Upvotes: 3
Views: 644
Reputation: 56809
Straight from the Android documentation, right after the list of short-hand character classes (\d
, \w
, \s
, etc.):
Note that these built-in classes don't just cover the traditional ASCII range. For example,
\w
is equivalent to the character class[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
.
This would also explain why Ì
is not replaced for the same code running on Android version.
While it is correct that the short-hand character classes also match Unicode character, the sample definition of \w
Android documentation is way outdated. See Appendix for more details.
In contrast, in Java SE, by default, \w
is equivalent to [a-zA-Z_0-9]
.
\w
only matches Unicode word character when Pattern.UNICODE_CHARACTER_CLASS
flag is specified. When the flag is specified:
\w
has the same definition as [\p{IsAlphabetic}\p{M}\p{Nd}\p{Pc}]
\w
is updated to [\p{IsAlphabetic}\p{M}\p{Nd}\p{Pc}\u200c\u200d]
Specify the character class directly. ICU regex doesn't support ASCII character class:
[^a-zA-Z0-9_]
\w
in ICUHere is the how the \w
has evolved over time:
The short-hand character class \w
was defined as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
(as shown in the documentation) up to ICU 3.0.
From ICU 3.2 (released on 2006/02/24) and up to and including ICU 4.8.1.1, [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}]
(equivalent to [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]
in the source code) is used instead. Changed at revision 16634
From ICU 49 (released on 2012/06/06), the current definition in the documentation is used [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]
(equivalent to [\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\u200c\u200d]
in the source code). Changed at revision 31278.
The string above is used to construct URX_ISWORD_SET
, which is used in regcmp.cpp
in doBackslashW
to compile the regex.
Even at android-1.6_r1 (Donut), when Pattern
class documentation is barren, it is already using ICU 3.8. The source code shows that it is using the definition from the second bullet point.
The documentation probably falls back to describe the behavior of the oldest version of Android.
If you want to navigate around the source code of Android yourself:
libcore
(Java Class Library)
android-1.6_r1
up to android-2.2.3_r2.1
, platform/dalvik
repository. Pattern
class can be located at libcore/regex/src/main/java/java/util/regex/Pattern.java
android-2.3_r1
to now, platform/libcore
repository. Pattern
class can be located at /luni/src/main/java/java/util/regex/Pattern.java
icu4c
(ICU library for C)
android-1.6_r1
up to android-4.4.4_r2.0.1
, platform/external/icu4c
repository. Regex related stuffs can be found in i18n
, Unicode related stuffs can be found in common
.android-5.0.0_r1
to now, platform/external/icu
. Enter icu4c/source
, then similar path as above.Upvotes: 4
Reputation: 626926
Have a look at Android Regular expression syntax documentation:
Note that these built-in classes don't just cover the traditional ASCII range. For example, \w is equivalent to the character class
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
. For more details see Unicode TR-18, and bear in mind that the set of characters in each class can vary between Unicode releases. If you actually want to match only ASCII characters, specify the explicit characters you want; if you mean 0-9 use[0-9]
rather than\d
, which would also include Gurmukhi digits and so forth.
Thus, use a range to make sure you only match English letters replaceAll("[^a-zA-Z0-9_]", "")
.
Upvotes: 1