Ohad Benita
Ohad Benita

Reputation: 543

Escaping a URL in Java

I have the following URL I want to escape :

http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip

I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL, e.g. '&' should be replaced with #26, space should be replaced with #20, etc

Java's URLEncoder will, for example, replace the spaces with a '+' sign, which isn't what I'm looking for

Upvotes: 2

Views: 6437

Answers (2)

eis
eis

Reputation: 53563

I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL

That's because there isn't any, since those are two separate things.

Printing in HTML should generally be done by replacing only ', ", <, > and & with &apos;, &quot;, &lt;, &gt; and &amp;. Here are examples doing that: Recommended method for escaping HTML in Java, the most trivial and easiest to reason with being

public static String encodeToHTML(String str) {
    return str
        .replace("'",  "&apos;")
        .replace("\"", "&quot;")
        .replace("<",  "&lt;")
        .replace(">",  "&gt;")
        .replace("&",  "&amp;");
}

Note that you need to have matching character set in your page, and be aware that if you for example print the url in an attribute field, requirements are a bit different.

Encoding as an url allows for a lot shorter list of characters. From URLEncoder documentation:

The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.

The special characters ".", "-", "*", and "_" remain the same.

The space character " " is converted into a plus sign "+".

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte.

The recommended encoding scheme to use is UTF-8.

You'd get those with

String encoded = new java.net.URLEncoder.encode(url, "UTF-8");

The above will give you HTML form encoding, which is close to what url encoding does, with a few noteable differences, the most relevant being + vs %20. For that, you can do this on its output:

String encoded = encoded.replace("+", "%20");

Note also that you don't want to use url encoding for the whole http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, but to the last part of it, TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, and the individual path segments if they are not fixed.

If you are in a position that you need to generate the url and print it in html, first encode it as an url, then do html escaping.

Upvotes: 3

Ohad Benita
Ohad Benita

Reputation: 543

Since I already know that the path part of the URL doesn't need special escaping I decided to go with the solution proposed here to encode only the zip file name part which answers the need in this case

 String urlEscaped = URLEncoder.encode(URL_TO_ESCAPE, "UTF-8")
            .replaceAll("\+", "%20")
            .replaceAll("\%21", "!")
            .replaceAll("\%27", "'")
            .replaceAll("\%28", "(")
            .replaceAll("\%29", ")")
            .replaceAll("\%7E", "~");

Upvotes: 0

Related Questions