Matthew Feickert
Matthew Feickert

Reputation: 897

Different results with NumPy percentile and TensorFlow percentile for "nearest" interpolation method

I've noticed that even though NumPy's numpy.percentile and TensorFlow Probability's tfp.stats.percentile give the same docstring explanation for their "nearest" interpolation method

This optional parameter specifies the interpolation method to use when the desired percentile lies between two data points i < j:

...

‘nearest’: i or j, whichever is nearest.

they give different results. Below is a minimal working example of what I mean.

Environment

$ "$(which python3)" --version
Python 3.7.5
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
numpy~=1.18
tensorflow~=2.1
tensorflow-probability~=0.9
black
(question) $ python -m pip install -r requirements.txt

Code

# question.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp


def main():
    a = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    q = 50
    print(f"Flattened array: {a.flatten()}")
    print("NumPy:")
    print(f"\t{q}th percentile (linear): {np.percentile(a, q, interpolation='linear')}")
    print(
        f"\t{q}th percentile (nearest): {np.percentile(a, q, interpolation='nearest')}"
    )

    b = tf.convert_to_tensor(a)
    print("TensorFlow:")
    print(
        f"\t{q}th percentile (linear): {tfp.stats.percentile(b, q, interpolation='linear')}"
    )
    print(
        f"\t{q}th percentile (nearest): {tfp.stats.percentile(b, q, interpolation='nearest')}"
    )


if __name__ == '__main__':
    main()

which when run gives differing results for the "nearest" interpolation method

(question) $ python question.py
Flattened array: [10.  7.  4.  3.  2.  1.]
NumPy:
    50th percentile (linear): 3.5
    50th percentile (nearest): 3.0
TensorFlow:
    50th percentile (linear): 3.5
    50th percentile (nearest): 4.0

After poking around the NumPy v1.18.2 source of the function that numpy.percentile is calling I'm still confused as to why. It seems that this is due to a rounding decision (given that NumPy uses numpy.around and TFP uses tf.round).

Can someone explain to me what is happening to cause the difference? I'd like to make a shim for the functions, but I need to understand the return behavior.

Upvotes: 0

Views: 1711

Answers (1)

Matthew Feickert
Matthew Feickert

Reputation: 897

Stepping through the source of both, it seems that it is not a rounding issue like I first though, but that numpy.percentile does the final evaluation on an ascending sorted ndarray, while tfp.stats.percentile does it on a descending sorted tensor.

# answer.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability.python.internal import tensorshape_util
from tensorflow_probability.python.internal import distribution_util


def numpy_src(input, q, axis=0, out=None):
    a = input
    q = np.true_divide(q, 100)  # 0.5
    q = np.asanyarray(q)  # array(0.5)
    q = q[None]  # array([0.5])
    ap = a.flatten()  # array([10.,  7.,  4.,  3.,  2.,  1.])
    Nx = ap.shape[axis]  # 6
    indices = q * (Nx - 1)  # array([2.5])
    indices = np.around(indices).astype(np.intp)  # array([2])
    ap.partition(indices, axis=axis)  # array([ 1.,  2.,  3.,  4.,  7., 10.])
    indices = indices[0]  # 2
    r = np.take(ap, indices, axis=axis, out=out)  # 3.0
    print(f"Result of np.percentile source: {r}")


def tensorflow_src(input, q=50, axis=None):
    x = input
    name = "percentile"
    interpolation = "nearest"
    q = tf.cast(q, tf.float64)  # tf.Tensor(50.0, shape=(), dtype=float64)
    if axis is None:
        y = tf.reshape(
            x, [-1]
        )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    frac_at_q_or_above = 1.0 - q / 100.0  # tf.Tensor(0.5, shape=(), dtype=float64)
    # _sort_tensor(y)
    # N.B. Here is the difference. Note the sort order is never changed
    sorted_y, _ = tf.math.top_k(
        y, k=tf.shape(y)[-1]
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64), _
    tensorshape_util.set_shape(
        sorted_y, y.shape
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    d = tf.cast(tf.shape(y)[-1], tf.float64)  # tf.Tensor(6.0, shape=(), dtype=float64)
    # _get_indices(interpolation)
    indices = tf.round(
        (d - 1) * frac_at_q_or_above
    )  # tf.Tensor(2.0, shape=(), dtype=float64)
    indices = tf.clip_by_value(
        tf.cast(indices, tf.int32), 0, tf.shape(y)[-1] - 1
    )  # tf.Tensor(2, shape=(), dtype=int32)
    # N.B. The sort order here is descending, causing a difference
    gathered_y = tf.gather(
        sorted_y, indices, axis=-1
    )  # tf.Tensor(4.0, shape=(), dtype=float64)
    result = distribution_util.rotate_transpose(gathered_y, tf.rank(q))  # 4.0
    print(f"Result of tf.percentile source: {result}")


def main():
    np_in = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    numpy_src(np_in, q=50)
    tf_in = tf.convert_to_tensor(np_in)
    tensorflow_src(tf_in, q=50)


if __name__ == "__main__":
    main()

which when run gives

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 4.0

If instead there was the following added to TensorFlow Probability's percentile to make the sort order of the evaluation ascending

sorted_y = tf.reverse(
    sorted_y, [-1]
)  # tf.Tensor([ 1.  2.  3.  4.  7. 10.], shape=(6,), dtype=float64)

then the two results would be the same

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 3.0

Given that TensorFlow Probability's docstring says

Given a vector x, the q-th percentile of x is the value q / 100 of the way from the minimum to the maximum in a sorted copy of x.

this seems wrong, as it is giving the reverse of that. I've opened TensorFlow Probability Issue 864 to discuss this.

Upvotes: 1

Related Questions