Reputation: 31233
I'm tring to create form validation unit that, in addition to "regular" tests checks encoding as well.
According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.
On the other hand, there are control characters in range 0x80-0xA0. In different sources I had seen that they are allowed and that not. Also I had seen that this is different for XHTML, HTML and XML.
Some articles had told that FF is allowed as well?
Can someone provide a good answer with sources what can be given and what isn't?
EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity
The C1 range is supported
But table shows that they are illegal and previous shown UTF-8 validations allows them?
Upvotes: 0
Views: 3475
Reputation: 97805
The Unicode characters in these ranges are valid in HTML 4.01:
0x09..0x0A 0x0D 0x20..0x7E 0x00A0..0xD7FF 0xE000..0x10FFFF
In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258
Upvotes: 2
Reputation: 56572
I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM>
elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":
This is the default content type. Forms submitted with this content type must be encoded as follows:
- Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
- The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
As for what character encoding is contained, (i.e. whether %A0
is a non-breaking space or an error), that's negotiated by the accept-charset
attribute on your <FORM>
element and the response's (well, really a GET
or POST
request) Content-Type
header.
Upvotes: 7
Reputation: 5610
Postel's Law: Be conservative in what you do; be liberal in what you accept from others.
If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.
Upvotes: 6
Reputation: 40739
The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.
This is a quote from the second link:
HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).
The way I read this is:
Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.
Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.
Upvotes: 1
Reputation: 13622
Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?
If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.
Upvotes: 0
Reputation: 23655
What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.
Upvotes: 0
Reputation: 655189
First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.
Upvotes: 1
Reputation: 161773
If the document is known to be XHTML, then you should just load it and validate it against the schema.
Upvotes: 0