Oliver Bernhardt
Oliver Bernhardt

Reputation: 419

Standard Normalization considering Skewness and Kurtosis

I have a rather fundamental statistics question. I know stack-overflow might not be the perfect place for it but me being a software-developer I don't know of any good statistics forums and stack-overflow served me very well in the past.

My problem is the following. I need to standard normalize some data. I have two different sets and after my normalization they should share roughly the same distribution. I used standard normalization for that until now (Standard Score: (x - mu)/ sigma). After transforming all values of my two distributions like this I want the resulting distribution of all transform values to be pretty much identical.

This worked well so far but now I ran into the problem that one of my two distributions is skewed. Standard normalization does not account for that so after the normalization, the mean and the standard deviation might be the same but one is skewed while the other distribution is symmetric.

My question now: Is there a known way of doing a standard normalization that considers also the skewing and kurtosis for the transformation? One important thing to mention is that my values can also be negative.

I can see that this might not be the right forum so I would also be very happy if someone can point me to a credible statistics forum.

Oli

Upvotes: 0

Views: 2662

Answers (2)

Severin Pappadeux
Severin Pappadeux

Reputation: 20080

I'm not sure such transformation exist in a generic and distribution independent way (which someone could call "Standard"). For Standard Normalization what you do is linear transformation ((x - mu)/ sigma), so that your distribution is now resembling N(0,1) - gaussian with mean of 0 and sigma of 1.

But skew is computed as Skew = 3 * (Mean – Median) / Standard Deviation. So with 0 mean and stddev of 1, what is left is -3*Median. So if you have now non-zero skew, it means non-zero median which you want to make 0.

For that the only option left is non-linear transformation, which I believe would be distribution-dependent. Basically, pjs made similar statement - conversion via quantiles assumed working with CDF and inverse CDF, and this is WAY beyond linear transformations, and cannot be standartized without dealing with distribution properties.

Maybe using simple model for skewed distribution - Skewed Normal - may produce some simple model for such transformation

Upvotes: 1

pjs
pjs

Reputation: 19853

If your goal is to see if the two data sets share the same distribution, no need to do normalization. You should consider using a Q-Q plot. If the data share a common distribution, even with different parameterizations, the result will fall fairly close to a straight line.

Generating the Q-Q plot is easy when you have the same amount of data in the two sets. Sort both sets, then pair them up and plot them. If the sets are different sizes, you'll have to interpolate the quantiles for the smaller set, which is more challenging.

In your current case though, if one of the sets is skewed (based on more than just one or two outliers) and the other is symmetric, they're probably from different distributions.

If your data are normally distributed then "standardizing" yields a standard normal when the true variance is used for the transformation, and a t-distribution when the sample variance is used. However, since standardizing is a linear transformation it is shape-preserving. If your data are not normal, the standard transformation will not magically make them bell-shaped and symmetric.

The only transformation I'm aware of that reliably yields the same reference distribution is conversion to quantiles. It's a well-known result that if random variable X has invertible CDF FX, then FX(X) ~ U(0,1), i.e., mapping X's through their own CDF yields quantiles normalized to the range (0,1). To apply this as a transformation, you have to know the correct CDF. That's where Q-Q plots are quite clever—if two data sets have the same underlying distribution, their quantiles will line up with each other regardless of whether you know the actual distribution or not.

Bottom line: if you want to know whether your two data sets have the same distribution, use Q-Q plotting. If you want a transformation that will yield a known reference distribution for any (continuous) input distribution, you'll need to know the actual CDF involved.

Upvotes: 3

Related Questions