Reputation: 651
Okay, so I'm working with the classical iris dataset built into R.
I am using the variable Sepal.Width within iris subsetted to the species "setosa". I plotted the probability histogram and it looks like it could follow a normal distribution (bell shaped curve), so I collected the mean of Sepal.Width and also the standard deviation of Sepal.Width, and that's all I need to plot the curve.
Normal/Gaussian Distribution formula:
f(x) = 1 / [σ * √(2 * π)] * e^[-(x - μ)^2 / (2 * σ^2)],
where e = Euler's number, π = pi, sigma (σ) = standard deviation, mu (μ) = mean.
So here is my code:
> hist(iris$Sepal.Width[iris$Species == "setosa"], probability = TRUE,
main = "Histogram of Sepal Width for Setosa Species Overlayed with Normal Distribution",
xlab = "Sepal Width (cm)", ylab = "Probability", cex.main = 0.9)
> sepal_width_setosa <- iris$Sepal.Width[iris$Species == "setosa"]
> sepal_width_setosa_mean <- mean(sepal_width_setosa)
> sepal_width_setosa_mean
[1] 3.428
> sepal_width_setosa_sd <- sd(sepal_width_setosa)
> sepal_width_setosa_sd
[1] 0.3790644
> sepal_width_setosa_variance = var(sepal_width_setosa)
> range(sepal_width_setosa)
[1] 2.3 4.4
> sepal_width_setosa_gaussian_distribution <- function(x){
1 / (sepal_width_setosa_sd*sqrt(2*pi))*exp(-(x - sepal_width_setosa_mean)^2 / (2 * sepal_width_setosa_variance))
}
> curve(sepal_width_setosa_gaussian_distribution, from = 2.0, to = 4.5,
col = "darkblue", lwd = 2, add = TRUE)
And I'll do another visual test for normality:
> qqnorm(sepal_width_setosa, main = "Normal Q-Q Plot of Sepal Width for species setosa")
And a statistical test for normality
> library(nortest)
> ad.test(sepal_width_setosa)
Anderson-Darling normality test
data: sepal_width_setosa
A = 0.49096, p-value = 0.2102
Wow! The p-value is greater than 0.05, so I guess the variable isn't normally distributed.
Question: Is there a way to put this on a loop to get the histograms with its overlayed associated normal distribution for sepal width for all of the iris species, and then show the plots side by side? And how do I add standard deviation markers 1 to 3?
Upvotes: 2
Views: 918
Reputation: 3194
To answer your question about for loops and histograms, I wrote some example code of what I would normally do.
species_list <- split(iris, iris$Species) # split data.frame into lists by Species
par(mfrow = c(1,length(species_list))) # set the grid of the plot to be 1 row, 3 columns
for(i in 1:length(species_list)){ # using length(species_list) for the # of species
# subsetting lists with double brackets `[[`
hist(species_list[[i]]$Sepal.Width, probability = T,
main = "", xlab = "Sepal Width (cm)", ylab = "Probability", # `main` is empty
col = c("cyan","green3","yellow")[i]) # picks a new color each time (here out of 3)
# my favorite function for plot titles
mtext(toupper(names(species_list)[i]), # upper case species name for title
side = 3, # 1 is bottom, 2 is left, 3 is top, 4 is right
line = 0, # do not shift up or down
font = 2) # 2 is bold
# Simpler than building from scratch use the built in function:
curve(dnorm(x, mean = mean(species_list[[i]]$Sepal.Width),
sd = sd(species_list[[i]]$Sepal.Width)),
yaxt = "n", add = TRUE, col = "darkblue", lwd = 2)
}
# At the end post a new title
mtext("Histograms of Sepal Width by Species Overlayed with a Normal Distribution",
outer = TRUE, # this puts it outside of the 3 plots
side = 3, line = -1, # shift it down so we can see it
font = 2)
par(mfrow = c(1,1)) # set the plot parameters back to a single plot when finished
Upvotes: 3