Reputation: 7267
I have a matrix (numpy 2d array) in which each row is a valid probability distribution. I have another vector (numpy 1d array), again a prob dist. I need to compute KL divergence between each row of the matrix and the vector. Is it possible to do this without using for loops?
This question asks the same thing, but none of the answers solve my problem. One of them suggests to use for loop which I want to avoid since I have large data. Another answer provides a solution in tensorflow, but I want for numpy arrays.
scipy.stats.entropy computes KL divergence between 2 vectors, but I couldn't get how to use it when one of them is a matrix.
Upvotes: 0
Views: 2525
Reputation: 114841
The function scipy.stats.entropy
can, in fact, do the vectorized calculation, but you have to reshape the arguments appropriately for it to work. When the inputs are two-dimensional arrays, entropy
expects the columns to hold the probability vectors. In the case where p
is two-dimensional and q
is one-dimensional, a trivial dimension must be added to q
to make the arguments compatible for broadcasting.
Here's an example. First, the imports:
In [10]: import numpy as np
In [11]: from scipy.stats import entropy
Create a two-dimensional p
whose rows are the probability vectors, and a one-dimensional probability vector q
:
In [12]: np.random.seed(8675309)
In [13]: p = np.random.rand(3, 5)
In [14]: p /= p.sum(axis=1, keepdims=True)
In [15]: q = np.random.rand(5)
In [16]: q /= q.sum()
In [17]: p
Out[17]:
array([[0.32085531, 0.29660176, 0.14113073, 0.07988999, 0.1615222 ],
[0.05870513, 0.15367858, 0.29585406, 0.01298657, 0.47877566],
[0.1914319 , 0.29324935, 0.1093297 , 0.17710131, 0.22888774]])
In [18]: q
Out[18]: array([0.06804561, 0.35392387, 0.29008139, 0.04580467, 0.24214446])
For comparison with the vectorized result, here's the result computed using a Python loop.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
To make entropy
do the vectorized calculation, the columns of the first argument must be the probability vectors, so we'll transpose p
. Then, to make q
compatible with p.T
, we'll reshape it into a two-dimensional array with shape (5, 1) (i.e. it contains a single column):
In [20]: entropy(p.T, q.reshape(-1, 1))
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
Note: It is tempting to use q.T
as the second argument, but that won't work. In NumPy, the transpose operation only swaps the lengths of existing dimensions--it never creates new dimensions. So the transpose of a one-dimensional array is itself. That is, q.T
is the same shape as q
.
Older version of this answer follows...
You can use scipy.special.kl_div
or scipy.special.rel_entr
to do this. Here's an example.
In [17]: import numpy as np
...: from scipy.stats import entropy
...: from scipy.special import kl_div, rel_entr
Make p
and q
for the example.
p
has shape (3, 5); the rows are the probability distributions. q
is a 1-d array with length 5.
In [18]: np.random.seed(8675309)
...: p = np.random.rand(3, 5)
...: p /= p.sum(axis=1, keepdims=True)
...: q = np.random.rand(5)
...: q /= q.sum()
This is the calculation that you want, using a Python loop and scipy.stats.entropy
. I include this here so the result can be compared to the vectorized calculation below.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
We have constructed p
and q
so that the probability vectors
each sum to 1. In this case, the above result can also be
computed in a vectorized calculation with scipy.special.rel_entr
or scipy.special.kl_div
. (I recommend rel_entr
. kl_div
adds and subtracts additional terms that will ultimately cancel out in the sum, so it does a bit more work than necessary.)
These functions compute only the point-wise part of the calculations;
you have to sum the result to get the actual entropy or divergence.
In [20]: rel_entr(p, q).sum(axis=1)
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
In [21]: kl_div(p, q).sum(axis=1)
Out[21]: array([0.32253909, 0.17897139, 0.26279053])
Upvotes: 1