xyz123
xyz123

Reputation: 651

How can I make this histogram of a dataset variable overplayed with it's normal distribution simpler and fancier?

Histogram with of Sepal Width for species setosa and its overlayed normal distribution function Okay, so I'm working with the classical iris dataset built into R. I am using the variable Sepal.Width within iris subsetted to the species "setosa". I plotted the probability histogram and it looks like it could follow a normal distribution (bell shaped curve), so I collected the mean of Sepal.Width and also the standard deviation of Sepal.Width, and that's all I need to plot the curve.

Normal/Gaussian Distribution formula:

f(x) = 1 / [σ * √(2 * π)] * e^[-(x - μ)^2 / (2 * σ^2)],

where e = Euler's number, π = pi, sigma (σ) = standard deviation, mu (μ) = mean.

So here is my code:

> hist(iris$Sepal.Width[iris$Species == "setosa"], probability = TRUE,
   main = "Histogram of Sepal Width for Setosa Species Overlayed with Normal Distribution", 
   xlab = "Sepal Width (cm)", ylab = "Probability", cex.main = 0.9)

> sepal_width_setosa <- iris$Sepal.Width[iris$Species == "setosa"]

> sepal_width_setosa_mean <- mean(sepal_width_setosa)

> sepal_width_setosa_mean
[1] 3.428

> sepal_width_setosa_sd <- sd(sepal_width_setosa)

> sepal_width_setosa_sd
[1] 0.3790644

> sepal_width_setosa_variance = var(sepal_width_setosa)

> range(sepal_width_setosa)
[1] 2.3 4.4

> sepal_width_setosa_gaussian_distribution <- function(x){
  1 / (sepal_width_setosa_sd*sqrt(2*pi))*exp(-(x - sepal_width_setosa_mean)^2 / (2 * sepal_width_setosa_variance))
}


> curve(sepal_width_setosa_gaussian_distribution, from = 2.0, to = 4.5,
   col = "darkblue", lwd = 2, add = TRUE)

And I'll do another visual test for normality:

> qqnorm(sepal_width_setosa, main = "Normal Q-Q Plot of Sepal Width for species setosa")

enter image description here

And a statistical test for normality

> library(nortest)
> ad.test(sepal_width_setosa)

Anderson-Darling normality test

data: sepal_width_setosa
A = 0.49096, p-value = 0.2102

Wow! The p-value is greater than 0.05, so I guess the variable isn't normally distributed.

Question: Is there a way to put this on a loop to get the histograms with its overlayed associated normal distribution for sepal width for all of the iris species, and then show the plots side by side? And how do I add standard deviation markers 1 to 3?

Upvotes: 2

Views: 918

Answers (1)

Evan Friedland
Evan Friedland

Reputation: 3194

To answer your question about for loops and histograms, I wrote some example code of what I would normally do.

species_list <- split(iris, iris$Species) # split data.frame into lists by Species

par(mfrow = c(1,length(species_list))) # set the grid of the plot to be 1 row, 3 columns 
for(i in 1:length(species_list)){ # using length(species_list) for the # of species
  # subsetting lists with double brackets `[[`
  hist(species_list[[i]]$Sepal.Width, probability = T, 
       main = "", xlab = "Sepal Width (cm)", ylab = "Probability", # `main` is empty
       col = c("cyan","green3","yellow")[i]) # picks a new color each time (here out of 3)
  # my favorite function for plot titles
  mtext(toupper(names(species_list)[i]), # upper case species name for title
        side = 3, # 1 is bottom, 2 is left, 3 is top, 4 is right
        line = 0, # do not shift up or down
        font = 2) # 2 is bold
  # Simpler than building from scratch use the built in function:
  curve(dnorm(x, mean = mean(species_list[[i]]$Sepal.Width), 
                 sd = sd(species_list[[i]]$Sepal.Width)), 
                 yaxt = "n", add = TRUE, col = "darkblue", lwd = 2)
}
# At the end post a new title
mtext("Histograms of Sepal Width by Species Overlayed with a Normal Distribution", 
      outer = TRUE, # this puts it outside of the 3 plots
      side = 3, line = -1, # shift it down so we can see it
      font = 2)
par(mfrow = c(1,1)) # set the plot parameters back to a single plot when finished

enter image description here

Upvotes: 3

Related Questions