Reputation: 543
I have the following URL I want to escape :
http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip
I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL, e.g. '&' should be replaced with #26, space should be replaced with #20, etc
Java's URLEncoder will, for example, replace the spaces with a '+' sign, which isn't what I'm looking for
Upvotes: 2
Views: 6437
Reputation: 53563
I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL
That's because there isn't any, since those are two separate things.
Printing in HTML should generally be done by replacing only '
, "
, <
, >
and &
with '
, "
, <
, >
and &
. Here are examples doing that: Recommended method for escaping HTML in Java, the most trivial and easiest to reason with being
public static String encodeToHTML(String str) {
return str
.replace("'", "'")
.replace("\"", """)
.replace("<", "<")
.replace(">", ">")
.replace("&", "&");
}
Note that you need to have matching character set in your page, and be aware that if you for example print the url in an attribute field, requirements are a bit different.
Encoding as an url allows for a lot shorter list of characters. From URLEncoder documentation:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte.
The recommended encoding scheme to use is UTF-8.
You'd get those with
String encoded = new java.net.URLEncoder.encode(url, "UTF-8");
The above will give you HTML form encoding, which is close to what url encoding does, with a few noteable differences, the most relevant being +
vs %20
. For that, you can do this on its output:
String encoded = encoded.replace("+", "%20");
Note also that you don't want to use url encoding for the whole http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip
, but to the last part of it, TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip
, and the individual path segments if they are not fixed.
If you are in a position that you need to generate the url and print it in html, first encode it as an url, then do html escaping.
Upvotes: 3
Reputation: 543
Since I already know that the path part of the URL doesn't need special escaping I decided to go with the solution proposed here to encode only the zip file name part which answers the need in this case
String urlEscaped = URLEncoder.encode(URL_TO_ESCAPE, "UTF-8")
.replaceAll("\+", "%20")
.replaceAll("\%21", "!")
.replaceAll("\%27", "'")
.replaceAll("\%28", "(")
.replaceAll("\%29", ")")
.replaceAll("\%7E", "~");
Upvotes: 0