Reputation: 283
I have two data sets var1 and var2. I want to compare the two distributions.
libary(ggplot2)
library(reshape)
set.seed(101)
var1 = rnorm(1000, 0.5)
var2 = rnorm(100000,0.5)
combine = melt(data.frame("var1" = var1,"var2"= var2))
ggplot(data = combine) + geom_density(aes(x = value, color = variable), alpha = 0.2)
This results in a density plot for the distribution which looks quite similar( except few wrigglings in median part), however, I want to show the tails of the distribution are not same. The spread is large for variable 2 than the one. Other than spread/quantiles which statistics could be compared to show the differences?
Does any one know any techniques, statistical test or visualization techniques especially to show the differences in the tails of the distribution(higher tails,higher extremes).
Upvotes: 2
Views: 1278
Reputation: 93821
You can show that the tails are in fact different by plotting quantiles of each sample (see below for an example).
In terms of testing whether the samples were drawn from a normal distribution, you could test for departures from normality with the Anderson-Darling test:
library(goftest)
ad.test(var1, "pnorm", mean=0.5)
ad.test(var2, "pnorm", mean=0.5)
You could also test for differences in kurtosis (how peaked the distributions are) with the Anscombe test to see whether the tails are statistically significantly different:
library(moments)
anscombe.test(var1)
anscombe.test(var2)
Neither of these tests suggest statistically significant departures from normality, which makes sense, since both samples are relatively large and they were in fact drawn from the same distribution.
You might find these links useful regarding testing for differences in the tails of a distribution: here and here.
In terms of visualizing the distributions, plotting quantiles might make it easier to discern differences in the tails:
library(ggpubr)
prob = seq(0,1,0.0001)
dat = combine %>% group_by(variable) %>%
summarise(value = list(quantile(value, probs=prob)),
Percentile = list(prob*100))
p = dat %>% unnest %>%
ggplot(aes(Percentile, value, colour=variable)) +
geom_line() +
theme_bw()
ggarrange(p + scale_x_continuous(limits=c(0,10), breaks=0:100),
p + scale_x_continuous(limits=c(90,100), breaks=0:100),
ncol=2, common.legend=TRUE)
An empirical cumulative density distribution is another option:
ggplot(combine, aes(value, colour=variable)) +
stat_ecdf() +
theme_bw()
Upvotes: 2