Reputation: 2355
There are 25 types of white spaces. Character.isWhitespace(char)
in the code below shows that four of the 25 types are not considered as white space in Java. Why ?
public class Main {
public static void main(String...args){
char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for(char space : whiteSpaces){
//All spaces are not white spaces in Java.
System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
}
}
}
Refer -https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(char)
Upvotes: 3
Views: 2325
Reputation: 719259
Why? Because that is how that method is specified. The javadoc for isWhiteSpace
lists the codes that it matches. The 4 that you identified are not in the list.
We can't tell you why it was defined that way. However, one implication of what the javadoc says is that '\u00A0'
, '\u2007'
and '\u202F'
are excluded because they are non-breaking whitespace characters.
'\u0085'
or NEL
is an interesting case. According to the Unicode code tables (see here for an unofficial summary) it is NOT a member of the general categories SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR. (It shows up in the CONTROL category.)
If you want a method that recognises all Unicode white space characters (i.e. characters in SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR), you should use isSpaceChar
(javadoc) instead of isWhiteSpace
.
Note that the Unicode spec is not a constant thing. The categorization of the codes, and indeed the definition of "white space" has evolved over time. Each Java version implements a specific version of the Unicode spec that was current at the time it was released. For example:
The details are in the javadoc for the Character
class for each Java version. Note that a given Java release is NOT patched to track subsequent Unicode releases.
The bottom line is that "white space" is a rather slippery concept. If you want a method that implements a specific meaning, you may need to implement it yourself.
Upvotes: 4
Reputation: 159135
If you read the documentation, i.e. the javadoc of Character.isWhitespace(char)
, it says:
Determines if the specified character is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:
- It is a Unicode space character (
SPACE_SEPARATOR
,LINE_SEPARATOR
, orPARAGRAPH_SEPARATOR
) but is not also a non-breaking space ('\u00A0'
,'\u2007'
,'\u202F'
).- It is
'\t'
,U+0009 HORIZONTAL TABULATION
.- It is
'\n'
,U+000A LINE FEED
.- It is
'\u000B'
,U+000B VERTICAL TABULATION
.- It is
'\f'
,U+000C FORM FEED
.- It is
'\r'
,U+000D CARRIAGE RETURN
.- It is
'\u001C'
,U+001C FILE SEPARATOR
.- It is
'\u001D'
,U+001D GROUP SEPARATOR
.- It is
'\u001E'
,U+001E RECORD SEPARATOR
.- It is
'\u001F'
,U+001F UNIT SEPARATOR
.
3 of the 4 you listed are explicitly excluded because they are non-breaking spaces.
As for U+0085 NEXT LINE (NEL)
, it is not a Unicode space character, and it is not considered a whitespace character by Java, as you can well see in that javadoc.
Upvotes: 3
Reputation: 3349
Java doesn't seem to expose the unicode whitespace list anywhere
In Java, isWhitespace is specifically defined as one of these:
Java also makes unicode spaces available, but not unicode whitespaces, via Character.isSpaceChar()
. This is a slightly different list.
char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
for(char space : whiteSpaces){
//All spaces are not white spaces in Java.
System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
}
Output:
[] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
If it's important for your application to match the unicode specs instead of the java specs, just define it yourself.
Upvotes: 1