Rhubbarb
Rhubbarb

Reputation: 4418

How do I get an ASCII code from a string in JavaScript?

(Similar questions to this have been asked on StackOverflow, but not exactly this. The nearest is probably "javascript how to convert unicode string to ascii", where there is already the remark "this has to be a dup[licate]". I have read some similar posts, but they don't answer my specific question. I've looked on the very good W3Schools site, and have also Googled it, but not found the answer that way either. So any hints here would be very much appreciated.)


I have an array of bytes being passed to a piece of JavaScript. In the JavaScript the data arrives in a string. I do not know the mechanism of transfer, as it's from a 3rd-party application. I do not know even whether the string is "wide" or "narrow".

In my JavaScript, I have some code like b = str.charCodeAt(pos);.

My problem is that a byte value such as 0x86 = 134 is coming through as character 0x2020 = 8224. This seems to be because my original byte interpreted as a Latin-1 (probably) 'dagger' character, and is then being translated to the equivalent Unicode code-point. (The problem may or may not be JavaScript's 'fault'.) Similar problems occur with other values, although the ranges 0x00..0x7F and 0xA0..0xFF seem to be fine, but most values from 0x80..0x9F are affected, in each case the value seems to be the Unicode for the original Latin-1.

Another observation is that the length of the string is what I'd expect for narrow string if the length was measured in bytes. (On the other hand, if length returns a value in abstract characters, this doesn't tell me anything.)

So, in JavaScript, is there a way at getting at the 'raw' bytes in a string, or getting a Latin-1 or ASCII character code directly, or of converting between character encodings, or defining the default encoding?

I could write my own mapping, but I'd rather not. I expect that is what I'll end up doing, but that has the feel of a kludge on a kludge.

I'm also looking into whether there's anything I can adjust in the calling application (as it could be passing the data as a wide string, although I doubt it).

Either way, though, I'd be interested in whether there is a simple JavaScript solution, or to understand why there isn't.

(If the incoming data was character data, having Unicode dealt with so automatically would be great. But it's not, it's just a binary data stream.)

Thanks.

Upvotes: 3

Views: 4449

Answers (2)

Nicholas Carey
Nicholas Carey

Reputation: 74385

Start with the Javascript (Ecmascript) specs: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf. Is says:

8.4 The String Type The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.

NOTE The rationale behind this design was to keep the implementation of Strings as simple and high-performing as possible. The intent is that textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it. Usually this would occur at the same time incoming text is converted from its original character encoding to Unicode (and would impose no additional overhead). Since it is recommended that ECMAScript source code be in Normalised Form C, string literals are guaranteed to be normalised (if source text is guaranteed to be normalised), as long as they do not contain any Unicode escape sequences.

What charCodeAt(p) gives you is the UTF-16 value (a 16-bit number) of the character at index p in the string. Since UTF-16 directly represents Unicode's Basic Multilingual Plane (that would be code points U+0000U+D7FF and U+E000U+FFFF, your Latin-1 characters should be the values you expect them to be.

That fact that they are not suggests to me that you have an encoding problem with the inbound 3rd octet stream — if the conversion to UTF-16 is being done and gets the encoding of the inbound octet stream wrong, you'll get odd results.

Perhaps that it is being treated as vanilla ASCII, when in fact it is UTF-8 (or vice-versa). UTF-8 represents code points above 0x7F as 2-, 3- or 4-octet "digraphs".

Upvotes: 3

Mike Samuel
Mike Samuel

Reputation: 120586

There is no such thing as the raw bytes in a String. The EcmaScript spec defines a string as a sequence of UTF-16 code-units. That is the most fine-grained representation exposed by any interpreter have ever encountered.

On the browser there are no encoding libraries. You have to roll your own if you are trying to represent a byte array as a string and want to reencode it.

If your string already happens to be valid ASCII, then you can get the numeric value of a code unit by using the charCodeAt method.

"\n".charCodeAt(0) === 10

Upvotes: 6

Related Questions