Ben G Small
Ben G Small

Reputation: 625

Adding a fitted Weibull distribution (fitdistr) to a geom_bar (ggplot2) categorical plot

I have created a barplot of Age vs. Population size (by gender) from Census data in ggplot2. Similarly, I have used the 'fitdist' function from the fitdistrplus package to derive Weibull parameters for the normalised (by maximum observed population across all Age bins) population data.

What I would like to do is to overlay the plotted data with the distribution as a line plot. I have tried

+ geom_line (denscomp(malefit.w))

Plus other numerous (unsuccessful) strategies.

Any help that could be provided would be much appreciated! Please find the syntax appended below:

Data Structure

    Order     Age    Male  Female   Total  male.norm
1      1   0 - 5 2870000 2820000 5690000 1.00000000
2      2   5 - 9 2430000 2390000 4820000 0.84668990
3      3 10 - 14 2340000 2250000 4590000 0.81533101
4      4 15 - 19 2500000 2500000 5000000 0.87108014
5      5 20 - 24 2690000 2680000 5370000 0.93728223
6      6 25 - 29 2540000 2520000 5060000 0.88501742
7      7 30 - 34 2040000 1990000 4030000 0.71080139
8      8 35 - 39 1710000 1760000 3470000 0.59581882
9      9 40 - 44 1400000 1550000 2950000 0.48780488
10    10 45 - 49 1200000 1420000 2620000 0.41811847
11    11 50 - 54 1010000 1210000 2220000 0.35191638
12    12 55 - 59  812000  985000 1800000 0.28292683
13    13 60 - 64  612000  773000 1390000 0.21324042
14    14 65 - 69  402000  556000  958000 0.14006969
15    15 70 - 74  293000  455000  748000 0.10209059
16    16 75 - 79  165000  316000  481000 0.05749129
17    17 80 - 84  101000  222000  323000 0.03519164
18    18 85 plus   75500  180000  256000 0.02630662
   female.norm 
1   1.00000000  
2   0.84751773   
3   0.79787234    
4   0.88652482    
5   0.95035461    
6   0.89361702    
7   0.70567376    
8   0.62411348   
9   0.54964539    
10  0.50354610    
11  0.42907801    
12  0.34929078    
13  0.27411348   
14  0.19716312    
15  0.16134752   
16  0.11205674    
17  0.07872340   
18  0.06382979 

Upvotes: 1

Views: 2372

Answers (1)

Ben G Small
Ben G Small

Reputation: 625

This is the answer to the original question I posed above. In conjunction with the data posted in the question it is a beginning to end solution (i.e. raw data to plot).

Fitting of South-African age-population data (by gender) to a Weibull distribution (Theresa Cain and Ben Small)

load libraries

library(MASS)
library(ggplot2)  

Import dataset

age_gender2 <- read.csv("age_gender2.csv", sep=",", header = T)

Define total population size by gender - that is sum the entire male / female population across all age bins and place in an objects 'total.male' and 'total.female' respectively

total.male <- sum(age_gender2$Male)
total.female <- sum(age_gender2$Female)

The object 'age.groups' is a single row, single column vector describing the number of age bins for the 'age_gender2' df

age.groups <- length(age_gender2$Age) 

The object 'age.all' is a 1 row, 18 column empty matrix that will describe the minimum age range extracted from the age bins (categories) in the 'Age' column from age_gender2 df

age.all <- matrix(0,1,age.groups)

Next line assigns min age to each element of matrix (1 X 18) for first column in each age group. So 'for' loop assigns each column of matrix as an age (HELP: writing a for loop in R).

Structure of the 'for' loop # RULE (given in parentheses()): for each element (i) loop from 2 to the value presented in the 'age.groups' object (i.e. 18) # COMMAND (given in curly brackets {}): taking each element (i) in the 'age.male' matrix and starting at the first row (i.e. [1, by each element (i.e. [1,i], perform / assign ('<-') the following operation: ((5 X (ith element - 1)) - 2.5). This operation provides the 'middle' age for the bin

this assigns the first element (row, column) in the 'age.all' matrix the value 2.5

age.all[1,1] <- 2.5 

for(i in 2:age.groups){ 

age.all[1,i] <- ((5*(i)) - 2.5)  

}

This next command 'rep' creates a (1 X 25190500) vector of all the ages within a particular bin

male.data <- rep(age.all,age_gender2$Male) 
female.data <- rep(age.all,age_gender2$Female)

Fit weibull distribution to age for male and female

male.weib <- fitdistr(male.data, "weibull")
female.weib <- fitdistr(female.data, "weibull")


male.shape <- male.weib$estimate[1] 
male.scale <- male.weib$estimate[2] 

female.shape <- female.weib$estimate[1] 
female.scale <- female.weib$estimate[2] 

Add column "Age_Median" to 'age_gender2' df with median age. Need to transpose as 'age.all' is an 1 row X 18 column vector.

age_gender2["Age_Median"] <- t(age.all)

Fit weibull distribution

The function 'pweibull' is a PDF and finds the cumulative probability over all ages, therefore we need to subtract the previous age bin(s) from the present bin to find the probability for that bin and hence (by multiplying by the total male population) the expected population for that bin.

male.p.weibull <- matrix(0,1,age.groups)
female.p.weibull <- matrix(0,1,age.groups)

for (i in 1:age.groups){

male.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, male.shape, male.scale) -  pweibull(age.all[1,i]-2.5, male.shape, male.scale)

 }

for (i in 1:age.groups){

female.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, female.shape, female.scale) - pweibull(age.all[1,i]-2.5, female.shape, female.scale)

 }

Add column to list calculated population per age bin - 'transpose' to 1 x 18 -> 18 row x 1 column vector

age_gender2["male.prob"] <- t(male.p.weibull * total.male)
age_gender2["female.prob"] <- t(female.p.weibull * total.female)

Create bar plots describing Age-Gender population distributions

Males (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)

agp.male <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Male, fill=Male)) +   geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Male Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$male.prob, group=1))

Females (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)

agp.female <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Female, fill=Female)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Female Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$female.prob, group=1))

Upvotes: 1

Related Questions