Reputation:
Please tell me how to calculate skewness and kurtosis along with their respective standard error and confidence interval associated with it(i.e. SE of Skewness and S.E of Kurtosis) I found two packages
1) package:'measure' can only calculate skewness and kurtosis
2) package:'rela' can calcuate both skewness and kurtosis but uses bootstrap by default and no command to turn it off during the calculation.
Upvotes: 3
Views: 8779
Reputation: 66
def skew_kurt(dataframe: pd.DataFrame) -> pd.DataFrame:
out = []
for col in dataframe:
x = dataframe[col]
sd = x.std()
if sd == 0:
out.append({name: np.nan for name in ['skew stat', 'skew se', 'kurt stat', 'kurt se']})
continue
w, m1 = len(x), x.mean()
dif = x - m1
m2, m3, m4 = tuple([(dif ** i).sum() for i in range(2, 5)])
skew = w * m3 / (w - 1) / (w - 2) / sd ** 3
skew_se = np.sqrt(6 * w * (w - 1) / ((w - 2) * (w + 1) * (w + 3)))
kurt = (w * (w + 1) * m4 - 3 * m2 ** 2 * (w - 1)) / ((w - 1) * (w - 2) * (w - 3) * sd ** 4)
kurt_se = np.sqrt(4 * (w ** 2 - 1) * skew_se ** 2 / ((w - 3) * (w + 5)))
out.append({'skew stat': skew, 'skew se': skew_se, 'kurt stat': kurt, 'kurt se': kurt_se})
dataframe = pd.DataFrame(out, index = list(dataframe))
dataframe['skew<2'] = np.absolute(dataframe['skew stat']) < 2 * dataframe['skew se']
dataframe['kurt<2'] = np.absolute(dataframe['kurt stat']) < 2 * dataframe['kurt se']
return dataframe[['skew stat', 'skew se', 'skew<2', 'kurt stat', 'kurt se', 'kurt<2']]
Upvotes: 0
Reputation: 2158
@HBat is right: provided your sample data is Gaussian, you can compute the standard error using the equation from wikipedia
n = len(sample)
se_skew = ((6*n*(n-1))/((n-2)*(n+1)*(n+3)))**0.5
However, @BigBendRegion is also right: if your data is not Gaussian, this does not work. Then you may need to bootstrap.
R has the DescTools package that can bootstrap confidence intervals for skew (among other things). It can be included in python using rpy2
like so:
""" Import rpy2 and the relevant package"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
DescTools = importr('DescTools')
""" You will probably need this if you want to work with numpy arrays"""
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
def compute_skew(data, confidence_level=0.99):
""" Compute the skew and confidence interval using rpy2, DescTools
@param data
@return dict with keys: skew, skew_ci_lower, skew_ci_upper"""
d = {}
d["skew"], d["skew_ci_lower"], d["skew_ci_upper"] = DescTools.Skew(data, conf_level=confidence_level)
return d
""" Call the function on your data (assuming that is saved in a variable named sample)"""
print(compute_skew(sample))
Upvotes: 1
Reputation: 220
The standard errors are valid for normal distributions, but not for other distributions. To see why, you can run the following code (which uses the spssSkewKurtosis
function shown above) to estimate the true confidence level of the interval obtained by taking the kurtosis estimate plus or minus 1.96 standard errors:
set.seed(12345)
Nsim = 10000
Correct = numeric(Nsim)
b1.ols = numeric(Nsim)
b1.alt = numeric(Nsim)
for (i in 1:Nsim) {
Data = rnorm(1000)
Kurt = spssSkewKurtosis(Data)[2,1]
seKurt = spssSkewKurtosis(Data)[2,2]
LowerLimit = Kurt -1.96*seKurt
UpperLimit = Kurt +1.96*seKurt
Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit
}
TrueConfLevel = mean(Correct)
TrueConfLevel
This gives you 0.9496, acceptably close to the expected 95%, so the standard errors work as expected when the data come from a normal distribution. But if you change Data = rnorm(1000)
to Data = runif(1000)
, then you are assuming that the data come from a uniform distribution, whose theoretical (excess) kurtosis is -1.2. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit
to Correct[i] = LowerLimit <= -1.2 & -1.2 <= UpperLimit
gives the result 1.0, meaning that the 95% intervals were always correct, rather than correct for 95% of the samples. Hence, the standard error seems to be overestimated (too large) for the (light-tailed) uniform distribution.
If you change Data = rnorm(1000)
to Data = rexp(1000)
, then you are assuming that the data come from an exponential distribution, whose theoretical (excess) kurtosis is 6.0. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit
to Correct[i] = LowerLimit <= 6.0 & 6.0 <= UpperLimit
gives the result 0.1007, meaning that the 95% intervals were correct only for 10.07% of the samples, rather than correct for 95% of the samples. Hence, the standard error seems to be underestimated (too small) for the (heavy-tailed) exponential distribution.
Those standard errors are grossly incorrect for non-normal distributions, as the simulation above shows. Thus, the only use of those standard errors is to compare the estimated kurtosis with the expected theoretical normal value (0.0); e.g., using a test of hypothesis. They cannot be used to construct a confidence interval for the true kurtosis.
Upvotes: 4
Reputation: 5702
I'm simply copying and pasting the code published by Howard Seltman in here:
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
spssSkewKurtosis=function(x) {
w=length(x)
m1=mean(x)
m2=sum((x-m1)^2)
m3=sum((x-m1)^3)
m4=sum((x-m1)^4)
s1=sd(x)
skew=w*m3/(w-1)/(w-2)/s1^3
sdskew=sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
kurtosis=(w*(w+1)*m4 - 3*m2^2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1^4)
sdkurtosis=sqrt( 4*(w^2-1) * sdskew^2 / ((w-3)*(w+5)) )
mat=matrix(c(skew,kurtosis, sdskew,sdkurtosis), 2,
dimnames=list(c("skew","kurtosis"), c("estimate","se")))
return(mat)
}
To get skewness and kurtosis of a variable along with their standard errors, simply run this function:
x <- rnorm(100)
spssSkewKurtosis(x)
## estimate se
## skew -0.684 0.241
## kurtosis 0.273 0.478
Upvotes: 5
Reputation: 1718
try package psych:
> a <- data.frame(cola=rep(c('A','B','C'),100),colb=sample(1:1000,300),colc=rnorm(300))
> describe(a)
vars n mean sd median trimmed mad min max range skew kurtosis se
cola* 1 300 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00 0.00 -1.51 0.05
colb 2 300 511.76 285.59 506.50 514.21 362.50 1.00 999.00 998.00 -0.04 -1.17 16.49
colc 3 300 0.12 1.04 0.05 0.10 1.07 -2.54 2.91 5.45 0.12 -0.24 0.06
> describe(a)$skew
[1] 0.00000000 -0.04418551 0.11857609
Upvotes: 0