Reputation: 135
Computing mean, total, etc. of each feature in a dataset seems quite trivial in Pandas
and Numpy
, but I couldn't find any similarly easy functions/operations for tf.data.Dataset
. Actually I found tf.data.Dataset.reduce
which allows me to compute running sum
, but it's not that easy for other operation (min
, max
, std
, etc.)
So, my question is, is there a simple way to compute statistics for tf.data.Dataset
? Moreover, is there a way to standardize/normalize (an entire, i.e. not in batch) tf.data.Dataset
, especially if not using tf.data.Dataset.reduce
?
Upvotes: 5
Views: 1605
Reputation: 730
So, my question is, is there a simple way to compute statistics for tf.data.Dataset?
It depends on the statistics you wish to compute.
For instance, to compute the minimum or maximum, you can use:
import numpy as np
import tensorflow as tf
ds = tf.data.Dataset.range(10, output_type=tf.float32) # sample dataset
minimum = ds.reduce(np.Inf, tf.math.minimum) # 0.0
maximum = ds.reduce(-np.Inf, tf.math.maximum) # 9.0
This is because tf.data.Dataset.reduce
's requirements for the reduce function are directly met by minimum and maximum.
To compute the mean (and perhaps other statistics), one approach is to use Keras metrics. The code gets a bit messier, but it does the trick:
mean = tf.keras.metrics.Mean()
for batch in ds:
mean.update_state(batch)
print(m.result().numpy()) # 7.0
To compute statistics other than the ones available in Keras, I guess you'll have to write your own reducer function. For instance, if you wish to implement a reducer for the standard deviation, you can calculate it based on a previous stddev and the new mean.
Moreover, is there a way to standardize/normalize (an entire, i.e. not in batch) tf.data.Dataset, especially if not using tf.data.Dataset.reduce?
No, this is not possible since the elements in a tf.data.Dataset
are not necessarily known until when you generate them.
Upvotes: 4