What is the best way to handle large data with Tensorflow.js and tf.Tensor?

Question

Question

I am using tf.Tensor and tf.concat() to handle large training data, and I found continuous using of tf.concat() gets slow. What is the best way to load large data from file to tf.Tensor?

Background

I think it's common way to handle data by array in Javascript. to achieve that, here is the rough steps to go.

steps to load data from file to Array

read line from file
parse line to Javascript's Object
add that object to array by Array.push()
after finish reading line to end, we can use that array with for loop.

so I think I can use tf.concat() in similar way to above.

steps to load data from file to tf.Tensor

read line from file
parse line to Javascript's Object
parse object to tf.Tensor
add tensor to original tensor by tf.concat()
after finish reading line to end, we can use that tf.Tensor

Some code

Here is some code to measure both speed of Array.push() and tf.concat()

import * as tf from "@tensorflow/tfjs"

let t = tf.tensor1d([1])
let addT = tf.tensor1d([2])

console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    t = tf.tidy(() => t.concat(addT))
}


let arr = []
let addA = 1
console.time()
for (let idx = 0; idx < 50000; idx++) {
    if (idx % 1000 == 0) {
        console.timeEnd()
        console.time()
        console.log(idx)
    }
    arr.push(addA)
}

Measurement

We can see stable process on Array.push(), but it gets slow on tf.concat()

For tf.concat()

default: 0.150ms
0
default: 68.725ms
1000
default: 62.922ms
2000
default: 23.199ms
3000
default: 21.093ms
4000
default: 27.808ms
5000
default: 39.689ms
6000
default: 34.798ms
7000
default: 45.502ms
8000
default: 94.526ms
9000
default: 51.996ms
10000
default: 76.529ms
11000
default: 83.662ms
12000
default: 45.730ms
13000
default: 89.119ms
14000
default: 49.171ms
15000
default: 48.555ms
16000
default: 55.686ms
17000
default: 54.857ms
18000
default: 54.801ms
19000
default: 55.312ms
20000
default: 65.760ms

For Array.push()

default: 0.009ms
0
default: 0.388ms
1000
default: 0.340ms
2000
default: 0.333ms
3000
default: 0.317ms
4000
default: 0.330ms
5000
default: 0.289ms
6000
default: 0.299ms
7000
default: 0.291ms
8000
default: 0.320ms
9000
default: 0.284ms
10000
default: 0.343ms
11000
default: 0.327ms
12000
default: 0.317ms
13000
default: 0.329ms
14000
default: 0.307ms
15000
default: 0.218ms
16000
default: 0.193ms
17000
default: 0.234ms
18000
default: 1.943ms
19000
default: 0.164ms
20000
default: 0.148ms

edkeveked · Accepted Answer

Though there is not a single way of creating a tensor, the answer of the questions lies to what is done with the tensors created.

Performance

tensors are immutable, therefore each time, tf.concat is called a new tensor is created.

let x = tf.tensor1d([2]);
console.log(tf.memory()) // "numTensors": 1
const y = tf.tensor1d([3])
x = tf.concat([x, y])
console.log(tf.memory()) // "numTensors": 3,

As we can see from the snippet above, the number of tensors that is created when tf.concat is called is 3 and not 2 . It is true that tf.tidy will dispose of unused tensors. But this operation of creating and disposing of tensors will become most and most costly as the created tensor is getting bigger and bigger. This is both an issue of memory consumption and computation since creating a new tensor will always delegate to a backend.

creating tensor from large data

Now that the issue of performance is understood, what is the best way to proceed ?

create the whole array in js and when the whole array is completed, then create the tensor.

for (i= 0; i < data.length; i++) {
  // fill array x
  x.push(dataValue)
}
// create the tensor
tf.tensor(x)

Though, it is the trivial solution, it is not always possible. Because create an array will keep data in memory and we can easily run out of memory with big data entries. Therefore sometimes, it might be best instead of creating the whole javascript array to create chunk of arrays and create a tensor from those chunk of arrays and start to process those tensors as soon as they are created. The chunk tensors can be merged using tf.concat again if necessary. But it might not always be required.

For instance we can call model.fit() repeatedly using chunk of tensors instead of calling it once with a big tensor that might take long to create. In this case, there is no need to concatenate the chunk tensors.

if possible create a dataset using tf.data. This is the ideal solution, if we are next to fit a model with the data.

function makeIterator() {

  const iterator = {
    next: () => {
      let result;
      if (index < data.length) {
        result = {value: dataValue, done: false};
        index++;
        return result;
      }
      return {value: dataValue, done: true};
    }
  };
  return iterator;
}
const ds = tf.data.generator(makeIterator);

The advantage of using tf.data is that the whole dataset is created by batches when needed during model.fit call.

What is the best way to handle large data with Tensorflow.js and tf.Tensor?

Question

Background

steps to load data from file to Array

steps to load data from file to tf.Tensor

Some code

Measurement

For tf.concat()

For Array.push()

Answers (2)

Examples

Impact on the runtime

Handling data too big for memory

Option 1: trainOnBatch

Option 2: Dataset generator

Performance

creating tensor from large data

Related Questions