Jayjay95
Jayjay95

Reputation: 199

Problems with computing the entropy of random variables

I've been trying to figure out how to calculate the entropy of a random variable X using

sp.stats.entropy()

from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values

pk

and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do

asset1/np.sum(asset1)

where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?

EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.

Upvotes: 0

Views: 1100

Answers (1)

Paul Brodersen
Paul Brodersen

Reputation: 13041

For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.

The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.

I have implementations for both on my github: https://github.com/paulbrodersen/entropy_estimators

The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.

Upvotes: 1

Related Questions