Reputation: 35772
Ok, so here is a problem analogous to my problem (I'll elaborate on the real problem below, but I think this analogy will be easier to understand).
I have a strange two-sided coin that only comes up heads (randomly) 1 in every 1,001 tosses (the remainder being tails). In other words, for every 1,000 tails I see, there will be 1 heads.
I have a peculiar disease where I only notice 1 in every 1,000 tails I see, but I notice every heads, and so it appears to me that the rate of noticing a heads or tails is 0.5. Of course, I'm aware of this disease and its effect so I can compensate for it.
Someone now gives me a new coin, and I noticed that the rate of noticing heads is now 0.6. Given that my disease hasn't changed (I still only notice 1 in every 1,000 tails), how do I calculate the actual ratio of heads to tails that this new coin produces?
Ok, so what is the real problem? Well, I have a bunch of data consisting of input, and outputs which are 1s and 0s. I want to teach a supervised machine learning algorithm to predict the expected output (a float between 0 and 1) given an input. The problem is that the 1s are very rare, and this screws up the internal math because it becomes very susceptible to rounding errors - even with high-precision floating point math.
So, I normalize the data by randomly omitting most of the 0 training samples so that it appears that there is a roughly equal ratio of 1s and 0s. Of course, this means that now the machine learning algorithm's output is no-longer predicting a probability, ie. instead of predicting 0.001 as it should, it would now predict 0.5.
I need a way to convert the output of the machine learning algorithm back to a probability within the original training set.
Author's Note (2015-10-07): I later discovered that this technique is commonly known as "downsampling"
Upvotes: 0
Views: 238
Reputation: 59655
You are calculating the following
calculatedRatio = heads / (heads + tails / 1000)
and you need
realRatio = heads / (heads + tails)
Solving both equations for tails yields the following equations.
tails = 1000 / calculatedRatio - 1000
tails = 1 / realRatio - 1
Combining both yields the following.
1000 / calculateRatio - 1000 = 1 / realRatio - 1
And finally solving for realRatio.
realRatio = 1 / (1000 / calculatedRatio - 999)
Seems to be correct. calculatedRatio 0.5 yields realRatio 1/1001, 0.6 yields 3 / 2003.
Upvotes: 2