dll
dll

Reputation: 161

Convert large array of integers to unicode string and then back to array of integers in node.js

I have some data which is represented as an array of integers and can be up to 200 000 elements. The integer value can vary from 0 to 200 000.

To emulate this data (for debugging purposes) I can do the following:

let data = [];
let len = 200000
for (let i = 0; i < len; i++) {
    data[i] = i;
}

To convert this array of integers as an unicode string I perform this:

let dataAsText = data.map((e) => {
    return String.fromCodePoint(e);
}).join('');

When I want to convert back to an array of integers the array appears to be longer:

let dataBack = dataAsText.split('').map((e) => {
    return e.codePointAt(e);
});
console.log(dataBack.length);

How does it come ? What is wrong ?

Extra information:

EDIT:

I use codePoint because I need to convert the data as unicode text so that the data is short.

More about codePoint vs charCode with an example: If we convert 150000 to a character then back to an integer with codePoint:

console.log(String.fromCodePoint("150000").codePointAt(0)); 

this gives us 150000 which is correct. Doing the same with charCode fails and prints 18928 (and not 150000):

console.log(String.fromCharCode("150000").charCodeAt(0));

Upvotes: 1

Views: 1346

Answers (4)

georg
georg

Reputation: 214959

If you're looking for a way to encode a list of integers so that you can safely transmit it over a network, node Buffers with base64 encoding might be a better option:

let data = [];
for (let i = 0; i < 200000; i++) {
    data.push(i);
}

// encoding

var ta = new Int32Array(data);
var buf = Buffer.from(ta.buffer);
var encoded = buf.toString('base64');

// decoding

var buf = Buffer.from(encoded, 'base64');
var ta = new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2);
var decoded = Array.from(ta);

// same?

console.log(decoded.join() == data.join())

Your original approach won't work because not every integer has a corresponding code point in unicode.

UPD: if you don't need the data to be binary-safe, no need for base64, just store the buffer as is:

// saving

var ta = new Int32Array(data);
fs.writeFileSync('whatever', Buffer.from(ta.buffer));

// loading

var buf = fs.readFileSync('whatever');
var loadedData = Array.from(new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2));

// same?

console.log(loadedData.join() == data.join())

Upvotes: 1

trincot
trincot

Reputation: 350300

That's because higher code point values will yield 2 words, as can be seen in this snippet:

var s = String.fromCodePoint(0x2F804)
console.log(s);  // Shows one character
console.log('length = ', s.length); // 2, because encoding is \uD87E\uDC04

var i = s.codePointAt(0);
console.log('CodePoint value at 0: ', i); // correct

var i = s.codePointAt(1); // Should not do this, it starts in the middle of a sequence!
console.log('CodePoint value at 1: ', i); // misleading

In your code things go wrong when you do split, as there the words making up the string are all split, discarding the fact that some pairs are intended to combine into a single character.

You can use the ES6 solution to this, where the spread syntax takes this into account:

let dataBack = [...dataAsText].map((e, i) => {
   // etc.

Now your counts will be the same.

Example:

// (Only 20 instead of 200000)
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join("");

console.log("String length: " + dataAsText.length);

let dataBack = [...dataAsText].map(e => e.codePointAt(0));

console.log(dataBack);

Surrogates

Be aware that in the range 0 ... 65535 there are ranges reserved for so-called surrogates, which only represent a character when combined with another value. You should not iterate over those expecting that these values represent a character on their own. So in your original code, this will be another source for error.

To fix this, you should really skip over those values:

for (let i = 0; i < len; i++) {
    if (i < 0xd800 || i > 0xdfff) data.push(i);
}

In fact, there are many other code points that do not represent a character.

Upvotes: 3

T.J. Crowder
T.J. Crowder

Reputation: 1074445

I don't think you want charPointAt (or charCodeAt) at all. To convert a number to a string, just use String; to have a single delimited string with all the values, use a delimiter (like ,); to convert it back to a number, use the appropriate one of Number, the unary +, parseInt, or parseFloat (in your case, Number or + probably):

// Only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.join(",");

console.log(dataAsText);

let dataBack = dataAsText.split(",").map(Number);

console.log(dataBack);

If your goal with codePointAt is to keep the dataAsText string short, then you can do that, but you can't use split to recreate the array because JavaScript strings are UTF-16 (effectively) and split("") will split at each 16-bit code unit rather than keeping code points together.

A delimiter would help there too:

// Again, only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join(",");

console.log("String length: " + dataAsText.length);

let dataBack = dataAsText.split(",").map(e => e.codePointAt(0));

console.log(dataBack);

Upvotes: 1

Joseph Young
Joseph Young

Reputation: 2795

I have a feeling split doesn't work with unicode values, a quick test above 65536 shows that they become double the length after splitting

Perhaps look at this post and answers, as they ask a similar question

Upvotes: 1

Related Questions