MasterJoe
MasterJoe

Reputation: 2355

Why does Java not recognize these white spaces?

There are 25 types of white spaces. Character.isWhitespace(char) in the code below shows that four of the 25 types are not considered as white space in Java. Why ?

public class Main {
    public static void main(String...args){
        char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
        for(char space : whiteSpaces){
            //All spaces are not white spaces in Java.
            System.out.println("[" + space + "] is a white space in Java:" + Character.isWhitespace(space));
        }
    }
}

Refer -https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Character.html#isWhitespace(char)

Upvotes: 3

Views: 2325

Answers (3)

Stephen C
Stephen C

Reputation: 719259

Why? Because that is how that method is specified. The javadoc for isWhiteSpace lists the codes that it matches. The 4 that you identified are not in the list.

We can't tell you why it was defined that way. However, one implication of what the javadoc says is that '\u00A0', '\u2007' and '\u202F' are excluded because they are non-breaking whitespace characters.

'\u0085' or NEL is an interesting case. According to the Unicode code tables (see here for an unofficial summary) it is NOT a member of the general categories SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR. (It shows up in the CONTROL category.)

If you want a method that recognises all Unicode white space characters (i.e. characters in SPACE_SEPARATOR, LINE_SEPARATOR or PARAGRAPH_SEPARATOR), you should use isSpaceChar (javadoc) instead of isWhiteSpace.

Note that the Unicode spec is not a constant thing. The categorization of the codes, and indeed the definition of "white space" has evolved over time. Each Java version implements a specific version of the Unicode spec that was current at the time it was released. For example:

  • Java 8 implements Unicode 6.2
  • Java 11 implements Unicode 10.0.0
  • Java 13 implements Unicode 12.1

The details are in the javadoc for the Character class for each Java version. Note that a given Java release is NOT patched to track subsequent Unicode releases.


The bottom line is that "white space" is a rather slippery concept. If you want a method that implements a specific meaning, you may need to implement it yourself.

Upvotes: 4

Andreas
Andreas

Reputation: 159135

If you read the documentation, i.e. the javadoc of Character.isWhitespace(char), it says:

Determines if the specified character is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:

  • It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
  • It is '\t', U+0009 HORIZONTAL TABULATION.
  • It is '\n', U+000A LINE FEED.
  • It is '\u000B', U+000B VERTICAL TABULATION.
  • It is '\f', U+000C FORM FEED.
  • It is '\r', U+000D CARRIAGE RETURN.
  • It is '\u001C', U+001C FILE SEPARATOR.
  • It is '\u001D', U+001D GROUP SEPARATOR.
  • It is '\u001E', U+001E RECORD SEPARATOR.
  • It is '\u001F', U+001F UNIT SEPARATOR.

3 of the 4 you listed are explicitly excluded because they are non-breaking spaces.

As for U+0085 NEXT LINE (NEL), it is not a Unicode space character, and it is not considered a whitespace character by Java, as you can well see in that javadoc.

Upvotes: 3

Daniel Centore
Daniel Centore

Reputation: 3349

Java doesn't seem to expose the unicode whitespace list anywhere

In Java, isWhitespace is specifically defined as one of these:

  • It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
  • It is '\t', U+0009 HORIZONTAL TABULATION.
  • It is '\n', U+000A LINE FEED.
  • It is '\u000B', U+000B VERTICAL TABULATION.
  • It is '\f', U+000C FORM FEED.
  • It is '\r', U+000D CARRIAGE RETURN.
  • It is '\u001C', U+001C FILE SEPARATOR.
  • It is '\u001D', U+001D GROUP SEPARATOR.
  • It is '\u001E', U+001E RECORD SEPARATOR.
  • It is '\u001F', U+001F UNIT SEPARATOR.

Java also makes unicode spaces available, but not unicode whitespaces, via Character.isSpaceChar(). This is a slightly different list.

char [] whiteSpaces = {'\u0085', '\u00A0', '\u2007', '\u202F'};
        for(char space : whiteSpaces){
            //All spaces are not white spaces in Java.
            System.out.println("[" + space + "] is a white space in Java: " + Character.isWhitespace(space) + " Unicode: " + Character.isSpaceChar(space));
        }

Output:

[] is a white space in Java: false Unicode: false
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true
[ ] is a white space in Java: false Unicode: true

If it's important for your application to match the unicode specs instead of the java specs, just define it yourself.

Upvotes: 1

Related Questions