OldCurmudgeon
OldCurmudgeon

Reputation: 65821

Where is the  (C2) coming from

For some reason a piece of code replaces spaces with \u00A0 - i.e. a Non-breaking space. This code is then used to sanitize a URL (yes I know that is very bad - in many ways). Strangely, when these are displayed in my test jsp a rogue  appears - why?

Sample JSP to demonstrate the issue.

<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>JSP Page</title>
    <%
      String[] parameters = request.getParameterValues("p");
      if (parameters == null || parameters.length == 0) {
        parameters = new String[]{""};
      }
    %>
  </head>
  <body>
    <h1>Hello World!</h1>
    <a href='index.jsp?p=<%="Hello\u00A0there"%>'>A Link</a>
    <p><%=parameters[0]%></p>
  </body>
</html>

Why is the parameter showing as Hello there? Where is the c2 coming from?

Added

BTW: The hex of the parameter is 48 65 6c 6c 6f c2 a0 74 68 65 72 65 showing the c2 in-situ.

Upvotes: 12

Views: 7916

Answers (2)

radoh
radoh

Reputation: 4819

To answer the actual question "Where is  (C2) coming from?", you may find this article helpful
Non-breaking space, 0x00A0 in UTF-16, is encoded as 0xC2A0 in UTF-8.

This table may help as well

Examples of encoded Unicode characters (in hexadecimal notation)

16-bit Unicode    UTF-8 Sequence
0001              01
007F              7F
0080              C2 80   <-- this was the case of nbsp
07FF              DF BF
0800              E0 A0 80
FFFF              EF BF BF
010000            F0 90 80 80
10FFFF            F4 8F BF BF

Upvotes: 5

Erwin Smout
Erwin Smout

Reputation: 18408

Rogue  appearing is most often an indication that something got encoded using UTF-8, and then decoded back again using a "traditional" code-page character set, e.g. ISO-8859-1, or CP850, or ...

Upvotes: 10

Related Questions