Illegal characters in URI

Question

The java.net.URI ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...

So my questions are:

Why doesn't the URI ctor accept 0x3000 but does accept other non-ASCII characters ?
What other characters doesn't it accept ?

Stephen C · Accepted Answer

Please note the 1st example contains the ideographic space rather than a regular space.

It is the ideographic space that is the problem.

Here is the code that allows non-ASCII characters to be used:

        } else if ((c > 128)
                   && !Character.isSpaceChar(c)
                   && !Character.isISOControl(c)) {
            // Allow unescaped but visible non-US-ASCII chars
            return p + 1;
        }

As you can see, it disallows "funky" non-visible characters.

See also the URI class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.

Why?

It is probably a safety measure.

What others are disallowed?

An character that is whitespace or a control character ... according to the respective Character predicate methods. (See the Character javadocs for a precise specification.)

You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:

convert the UCS character code to UTF-8, and
percent encode the UTF-8 bytes as required by the spec.

My understanding is that the URI.toASCIIString() method will take care of that if you have a "deviant" java.net.URI object.

Illegal characters in URI

Answers (2)

Character categories

Related Questions