Reputation: 42050
The java.net.URI
ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...
So my questions are:
URI
ctor accept 0x3000
but does accept other non-ASCII characters ?Upvotes: 0
Views: 6622
Reputation: 718798
Please note the 1st example contains the ideographic space rather than a regular space.
It is the ideographic space that is the problem.
Here is the code that allows non-ASCII characters to be used:
} else if ((c > 128)
&& !Character.isSpaceChar(c)
&& !Character.isISOControl(c)) {
// Allow unescaped but visible non-US-ASCII chars
return p + 1;
}
As you can see, it disallows "funky" non-visible characters.
See also the URI
class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.
Why?
It is probably a safety measure.
What others are disallowed?
An character that is whitespace or a control character ... according to the respective Character
predicate methods. (See the Character
javadocs for a precise specification.)
You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:
My understanding is that the URI.toASCIIString()
method will take care of that if you have a "deviant" java.net.URI
object.
Upvotes: 0
Reputation: 122364
The set of acceptable characters is spelled out in detail in the JavaDoc documentation for java.net.URI
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
- alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
- digit The US-ASCII decimal digit characters, '0' through '9'
- alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
- punct The characters in the string ",;:$&+="
- reserved All punct characters together with those in the string "?/[]@"
- escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
- other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the
Character.isISOControl
method), and are not space characters (according to theCharacter.isSpaceChar
method) (Deviation from RFC 2396, which is limited to US-ASCII)The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
In particular, "other" does not include space characters, which are defined (by Character.isSpaceChar) as those with Unicode general category types
and according to the page you've linked to in the question, the ideographic space character is indeed one of these types.
Upvotes: 5