Reputation: 161
I have some data which is represented as an array of integers and can be up to 200 000 elements. The integer value can vary from 0 to 200 000.
To emulate this data (for debugging purposes) I can do the following:
let data = [];
let len = 200000
for (let i = 0; i < len; i++) {
data[i] = i;
}
To convert this array of integers as an unicode string I perform this:
let dataAsText = data.map((e) => {
return String.fromCodePoint(e);
}).join('');
When I want to convert back to an array of integers the array appears to be longer:
let dataBack = dataAsText.split('').map((e) => {
return e.codePointAt(e);
});
console.log(dataBack.length);
How does it come ? What is wrong ?
Extra information:
I use codePointAt/fromCodePoint because it can deal with all unicode values (up to 21 bits) while charCodeAt/fromCharCode fails.
Using, for example, .join('123') and .split('123') will make that the variable dataBack is the same length as data. But this isn't an elegant solution because the size of the string dataAsText will unnecessarily be very large.
If let len is equal or less to 65536 (which is 2^16 or 16 bits max value) then everything works fine. Which is strange ?
EDIT:
I use codePoint because I need to convert the data as unicode text so that the data is short.
More about codePoint vs charCode with an example: If we convert 150000 to a character then back to an integer with codePoint:
console.log(String.fromCodePoint("150000").codePointAt(0));
this gives us 150000 which is correct. Doing the same with charCode fails and prints 18928 (and not 150000):
console.log(String.fromCharCode("150000").charCodeAt(0));
Upvotes: 1
Views: 1346
Reputation: 214959
If you're looking for a way to encode a list of integers so that you can safely transmit it over a network, node Buffers with base64 encoding might be a better option:
let data = [];
for (let i = 0; i < 200000; i++) {
data.push(i);
}
// encoding
var ta = new Int32Array(data);
var buf = Buffer.from(ta.buffer);
var encoded = buf.toString('base64');
// decoding
var buf = Buffer.from(encoded, 'base64');
var ta = new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2);
var decoded = Array.from(ta);
// same?
console.log(decoded.join() == data.join())
Your original approach won't work because not every integer has a corresponding code point in unicode.
UPD: if you don't need the data to be binary-safe, no need for base64, just store the buffer as is:
// saving
var ta = new Int32Array(data);
fs.writeFileSync('whatever', Buffer.from(ta.buffer));
// loading
var buf = fs.readFileSync('whatever');
var loadedData = Array.from(new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2));
// same?
console.log(loadedData.join() == data.join())
Upvotes: 1
Reputation: 350300
That's because higher code point values will yield 2 words, as can be seen in this snippet:
var s = String.fromCodePoint(0x2F804)
console.log(s); // Shows one character
console.log('length = ', s.length); // 2, because encoding is \uD87E\uDC04
var i = s.codePointAt(0);
console.log('CodePoint value at 0: ', i); // correct
var i = s.codePointAt(1); // Should not do this, it starts in the middle of a sequence!
console.log('CodePoint value at 1: ', i); // misleading
In your code things go wrong when you do split
, as there the words making up the string are all split, discarding the fact that some pairs are intended to combine into a single character.
You can use the ES6 solution to this, where the spread syntax takes this into account:
let dataBack = [...dataAsText].map((e, i) => {
// etc.
Now your counts will be the same.
Example:
// (Only 20 instead of 200000)
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.map(e => String.fromCodePoint(e)).join("");
console.log("String length: " + dataAsText.length);
let dataBack = [...dataAsText].map(e => e.codePointAt(0));
console.log(dataBack);
Be aware that in the range 0 ... 65535 there are ranges reserved for so-called surrogates, which only represent a character when combined with another value. You should not iterate over those expecting that these values represent a character on their own. So in your original code, this will be another source for error.
To fix this, you should really skip over those values:
for (let i = 0; i < len; i++) {
if (i < 0xd800 || i > 0xdfff) data.push(i);
}
In fact, there are many other code points that do not represent a character.
Upvotes: 3
Reputation: 1074445
I don't think you want charPointAt
(or charCodeAt
) at all. To convert a number to a string, just use String
; to have a single delimited string with all the values, use a delimiter (like ,
); to convert it back to a number, use the appropriate one of Number
, the unary +
, parseInt
, or parseFloat
(in your case, Number
or +
probably):
// Only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.join(",");
console.log(dataAsText);
let dataBack = dataAsText.split(",").map(Number);
console.log(dataBack);
If your goal with codePointAt
is to keep the dataAsText
string short, then you can do that, but you can't use split
to recreate the array because JavaScript strings are UTF-16 (effectively) and split("")
will split at each 16-bit code unit rather than keeping code points together.
A delimiter would help there too:
// Again, only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.map(e => String.fromCodePoint(e)).join(",");
console.log("String length: " + dataAsText.length);
let dataBack = dataAsText.split(",").map(e => e.codePointAt(0));
console.log(dataBack);
Upvotes: 1
Reputation: 2795
I have a feeling split doesn't work with unicode values, a quick test above 65536 shows that they become double the length after splitting
Perhaps look at this post and answers, as they ask a similar question
Upvotes: 1