Hoang Minh Nguyen
Hoang Minh Nguyen

Reputation: 135

How to compute statistics (sum, mean, variance, etc.) over an entire dataset in Tensorflow

Computing mean, total, etc. of each feature in a dataset seems quite trivial in Pandas and Numpy, but I couldn't find any similarly easy functions/operations for tf.data.Dataset. Actually I found tf.data.Dataset.reduce which allows me to compute running sum, but it's not that easy for other operation (min, max, std, etc.)

So, my question is, is there a simple way to compute statistics for tf.data.Dataset? Moreover, is there a way to standardize/normalize (an entire, i.e. not in batch) tf.data.Dataset, especially if not using tf.data.Dataset.reduce?

Upvotes: 5

Views: 1605

Answers (1)

ruancomelli
ruancomelli

Reputation: 730

So, my question is, is there a simple way to compute statistics for tf.data.Dataset?

It depends on the statistics you wish to compute.

For instance, to compute the minimum or maximum, you can use:

import numpy as np
import tensorflow as tf

ds = tf.data.Dataset.range(10, output_type=tf.float32) # sample dataset

minimum = ds.reduce(np.Inf, tf.math.minimum) # 0.0
maximum = ds.reduce(-np.Inf, tf.math.maximum) # 9.0

This is because tf.data.Dataset.reduce's requirements for the reduce function are directly met by minimum and maximum.

To compute the mean (and perhaps other statistics), one approach is to use Keras metrics. The code gets a bit messier, but it does the trick:

mean = tf.keras.metrics.Mean()
for batch in ds:
    mean.update_state(batch)

print(m.result().numpy()) # 7.0

To compute statistics other than the ones available in Keras, I guess you'll have to write your own reducer function. For instance, if you wish to implement a reducer for the standard deviation, you can calculate it based on a previous stddev and the new mean.

Moreover, is there a way to standardize/normalize (an entire, i.e. not in batch) tf.data.Dataset, especially if not using tf.data.Dataset.reduce?

No, this is not possible since the elements in a tf.data.Dataset are not necessarily known until when you generate them.

Upvotes: 4

Related Questions