Reputation: 41
Im trying to understand the skewness and kurtosis of a numeric variable, to understand the shape of the data.
I calculate first with the skewness command like this:
skewness(data$responsetime)
[1] 26.56731
And the kurtosis:
kurtosis(data$responsetime)
[1] 3723.961
The skewness is positive so the tail should go the the right, and kurtosis is >= 3.
Now I would like to confirm both the skewness and the kurtosis with a plot. I try that like this:
plot(density(data$responsetime)
)
And Im getting a plot like below that its difficult to get some conclusion. Im new to R and Im trying to get this graph more clear, like adjusting the x size or something, but Im not finding the command to do that. Do oyu know how to do that?
Using a histogram, like this:
hist(data$responsetime, breaks=100)
I also get a graph difficult to understand:
With plot(data$responsetime, xlim=c(0, 20000)) I get this:
With: plot(density(data$responsetime), xlim=c(0, 20000))
I get the graph below. But I dont understand, in the x axis I have the response time. The maximum value in response time with max(data$responsetime) is 320000, so how the tail stops arround 18000?
Upvotes: 0
Views: 14404
Reputation: 220
Use qqnorm along with qqline - that shows both skewness and kurtosis very clearly.
code:
qqnorm(data$responsetime)
qqline(data$responsetime)
Right skew typically exhibits a convex appearance, left skew typically concave. With excess kurtosis <0, typically the tails are closer to the horizontal mid-line than the qqline predicts; with excess kurtosis >0, typically one or both of the tails is more extreme (farther away from the horizontal mid-line) than the qqline predicts.
You should see a concave appearance in the qq-plot of your data, with the right tail much above the qqline. This indicates that your distribution produces outliers greatly in excess of what is predicted by the normal distribution in the right tail.
Kurtosis measures outliers, not the peak of the distribution. That might be a source of confusion for some people when it comes to relating the kurtosis statistic to the histogram.
The logic to understand why kurtosis measures outliers (not peak) is simple: Large |Z|-values indicate outliers. Kurtosis is the average of the Z^4 values. So |Z|-values close to zero (where the peak is) contribute virtually nothing to the kurtosis statistic, and thus the kurtosis statistic is non-informative about the peak. You can have a high kurtosis when the peak is pointy and you can have a high kurtosis when the peak is flat. It all depends on the disposition of the outliers.
Upvotes: 3
Reputation: 493
relating to the hist() function:
hist(data$responsetime, breaks='FD')
I have found "breaks='FD'" usually returns enough break points in the histogram to solve this issue. Also, from the graph it looks like you do have a very long tail.
Side bar: If you data are that skewed you may consider transforming the data before working with them.
Upvotes: 0