Charset in data URI

Question

Over the years from reading the evolving specs I had assumed that RFC 3986 had finally settled on UTF-8 encoding for escape octet sequences. That is, if my URI has %XX%YY%ZZ I can take that sequence of decoded octets (for any URI in the scheme-specific part) and interpret the resulting bytes as UTF-8 to find out what decoded information was intended. In practical terms, I can call JavaScript decodeURIComponent() which does this decoding automatically for me.

Then I read the spec for data: URIs, RFC 2397, which includes a charset argument, which (naturally) indicates the charset of the encoded data. But how does that work? If I have a two-octet encoded sequence %XX%YY in my data: URI, does a charset=iso-8859-1 indicate that the two decoded octects should not be interpreted as a UTF-8 sequence, but as as two separate Latin characters (as each byte in ISO-8859-1 represents a character)? RFC 2397 seems to indicate this, as it gives an example of "greek [sic] characters":

data:text/plain;charset=iso-8859-7,%be%fg%be

But this means that JavaScript decodeURIComponent() (which assumes UTF-8 encoded octets) can't be used to extract a string from a data URI, correct? Does this mean I have to create my own decoding for data URIs if the charset is something besides UTF-8?

Furthermore, does this mean that RFC 2397 is now in conflict with RFC 3986, which seems to indicate that UTF-8 is assumed? Or does RFC 3986 only refer "new URI scheme[s]", meaning that the data: URI scheme gets grandfathered in and has its own technique for specifying what the encoded octets means?

My best guess at the moment is that data: plays by its own rules and if it indicates a charset other than UTF-8, I'll have to use something other than decodeURIComponent() in JavaScript. Any recommendations on a replacement method would be welcome, too.

Charset in data URI

Answers (1)

Related Questions