Reputation: 625
I have created a barplot of Age vs. Population size (by gender) from Census data in ggplot2. Similarly, I have used the 'fitdist' function from the fitdistrplus package to derive Weibull parameters for the normalised (by maximum observed population across all Age bins) population data.
What I would like to do is to overlay the plotted data with the distribution as a line plot. I have tried
+ geom_line (denscomp(malefit.w))
Plus other numerous (unsuccessful) strategies.
Any help that could be provided would be much appreciated! Please find the syntax appended below:
Data Structure
Order Age Male Female Total male.norm
1 1 0 - 5 2870000 2820000 5690000 1.00000000
2 2 5 - 9 2430000 2390000 4820000 0.84668990
3 3 10 - 14 2340000 2250000 4590000 0.81533101
4 4 15 - 19 2500000 2500000 5000000 0.87108014
5 5 20 - 24 2690000 2680000 5370000 0.93728223
6 6 25 - 29 2540000 2520000 5060000 0.88501742
7 7 30 - 34 2040000 1990000 4030000 0.71080139
8 8 35 - 39 1710000 1760000 3470000 0.59581882
9 9 40 - 44 1400000 1550000 2950000 0.48780488
10 10 45 - 49 1200000 1420000 2620000 0.41811847
11 11 50 - 54 1010000 1210000 2220000 0.35191638
12 12 55 - 59 812000 985000 1800000 0.28292683
13 13 60 - 64 612000 773000 1390000 0.21324042
14 14 65 - 69 402000 556000 958000 0.14006969
15 15 70 - 74 293000 455000 748000 0.10209059
16 16 75 - 79 165000 316000 481000 0.05749129
17 17 80 - 84 101000 222000 323000 0.03519164
18 18 85 plus 75500 180000 256000 0.02630662
female.norm
1 1.00000000
2 0.84751773
3 0.79787234
4 0.88652482
5 0.95035461
6 0.89361702
7 0.70567376
8 0.62411348
9 0.54964539
10 0.50354610
11 0.42907801
12 0.34929078
13 0.27411348
14 0.19716312
15 0.16134752
16 0.11205674
17 0.07872340
18 0.06382979
Upvotes: 1
Views: 2372
Reputation: 625
This is the answer to the original question I posed above. In conjunction with the data posted in the question it is a beginning to end solution (i.e. raw data to plot).
Fitting of South-African age-population data (by gender) to a Weibull distribution (Theresa Cain and Ben Small)
load libraries
library(MASS)
library(ggplot2)
Import dataset
age_gender2 <- read.csv("age_gender2.csv", sep=",", header = T)
Define total population size by gender - that is sum the entire male / female population across all age bins and place in an objects 'total.male' and 'total.female' respectively
total.male <- sum(age_gender2$Male)
total.female <- sum(age_gender2$Female)
The object 'age.groups' is a single row, single column vector describing the number of age bins for the 'age_gender2' df
age.groups <- length(age_gender2$Age)
The object 'age.all' is a 1 row, 18 column empty matrix that will describe the minimum age range extracted from the age bins (categories) in the 'Age' column from age_gender2 df
age.all <- matrix(0,1,age.groups)
Next line assigns min age to each element of matrix (1 X 18) for first column in each age group. So 'for' loop assigns each column of matrix as an age (HELP: writing a for loop in R).
Structure of the 'for' loop # RULE (given in parentheses()): for each element (i) loop from 2 to the value presented in the 'age.groups' object (i.e. 18) # COMMAND (given in curly brackets {}): taking each element (i) in the 'age.male' matrix and starting at the first row (i.e. [1, by each element (i.e. [1,i], perform / assign ('<-') the following operation: ((5 X (ith element - 1)) - 2.5). This operation provides the 'middle' age for the bin
this assigns the first element (row, column) in the 'age.all' matrix the value 2.5
age.all[1,1] <- 2.5
for(i in 2:age.groups){
age.all[1,i] <- ((5*(i)) - 2.5)
}
This next command 'rep' creates a (1 X 25190500) vector of all the ages within a particular bin
male.data <- rep(age.all,age_gender2$Male)
female.data <- rep(age.all,age_gender2$Female)
Fit weibull distribution to age for male and female
male.weib <- fitdistr(male.data, "weibull")
female.weib <- fitdistr(female.data, "weibull")
male.shape <- male.weib$estimate[1]
male.scale <- male.weib$estimate[2]
female.shape <- female.weib$estimate[1]
female.scale <- female.weib$estimate[2]
Add column "Age_Median" to 'age_gender2' df with median age. Need to transpose as 'age.all' is an 1 row X 18 column vector.
age_gender2["Age_Median"] <- t(age.all)
Fit weibull distribution
The function 'pweibull' is a PDF and finds the cumulative probability over all ages, therefore we need to subtract the previous age bin(s) from the present bin to find the probability for that bin and hence (by multiplying by the total male population) the expected population for that bin.
male.p.weibull <- matrix(0,1,age.groups)
female.p.weibull <- matrix(0,1,age.groups)
for (i in 1:age.groups){
male.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, male.shape, male.scale) - pweibull(age.all[1,i]-2.5, male.shape, male.scale)
}
for (i in 1:age.groups){
female.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, female.shape, female.scale) - pweibull(age.all[1,i]-2.5, female.shape, female.scale)
}
Add column to list calculated population per age bin - 'transpose' to 1 x 18 -> 18 row x 1 column vector
age_gender2["male.prob"] <- t(male.p.weibull * total.male)
age_gender2["female.prob"] <- t(female.p.weibull * total.female)
Create bar plots describing Age-Gender population distributions
Males (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.male <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Male, fill=Male)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Male Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$male.prob, group=1))
Females (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.female <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Female, fill=Female)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Female Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$female.prob, group=1))
Upvotes: 1