Reputation: 65821
For some reason a piece of code replaces spaces with \u00A0
- i.e. a Non-breaking space. This code is then used to sanitize a URL (yes I know that is very bad - in many ways). Strangely, when these are displayed in my test jsp a rogue Â
appears - why?
Sample JSP to demonstrate the issue.
<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>JSP Page</title>
<%
String[] parameters = request.getParameterValues("p");
if (parameters == null || parameters.length == 0) {
parameters = new String[]{""};
}
%>
</head>
<body>
<h1>Hello World!</h1>
<a href='index.jsp?p=<%="Hello\u00A0there"%>'>A Link</a>
<p><%=parameters[0]%></p>
</body>
</html>
Why is the parameter showing as Hello there
? Where is the c2
coming from?
Added
BTW: The hex of the parameter
is 48 65 6c 6c 6f c2 a0 74 68 65 72 65
showing the c2
in-situ.
Upvotes: 12
Views: 7916
Reputation: 4819
To answer the actual question "Where is  (C2) coming from?", you may find this article helpful
Non-breaking space, 0x00A0
in UTF-16, is encoded as 0xC2A0
in UTF-8.
This table may help as well
Examples of encoded Unicode characters (in hexadecimal notation)
16-bit Unicode UTF-8 Sequence 0001 01 007F 7F 0080 C2 80 <-- this was the case of nbsp 07FF DF BF 0800 E0 A0 80 FFFF EF BF BF 010000 F0 90 80 80 10FFFF F4 8F BF BF
Upvotes: 5
Reputation: 18408
Rogue  appearing is most often an indication that something got encoded using UTF-8, and then decoded back again using a "traditional" code-page character set, e.g. ISO-8859-1, or CP850, or ...
Upvotes: 10