Reputation: 51
I am using WHO's suicide statistics data that can be found here: https://www.kaggle.com/szamil/who-suicide-statistics . What I am trying to do is plot a line graph which will have years on the x axis and suicide rates on the y axis. As you will be able to see in the data it has suicide statistics for every country, age group and gender separately. What I want to do is plot a graph for one specific country, summarize the number of suicides from all age groups but have two different lines for females and males. Within my code I created a subset of the WHO data according to the user input (I am also creating a web app):
who_subset <- who[country, ]
where country is a reactive variable. What I want to get is this:
The code I am currently using is this:
library(ggplot2)
ggplot(data = who, aes(x = year, y = suicides_no)) +
geom_point() +
geom_line(aes(weights = suicides_no), stat = "identity")
I cannot upload the picture of the graph that I get when I run this but it is not continuous and it has several points for each year. It looks like a histogram in a way because it connects the points vertically (for one year) rather than having one point for each year and then connecting those points horizontally. Could anyone please guide me to plotting the graph that I want that would look like the one on the second picture? Any help is greatly appreciated.
Upvotes: 1
Views: 3014
Reputation: 4989
# For lack of a better source:
who <- read.csv("https://github.com/anudeike/who-suicide-stats/raw/master/data/who_suicide_statistics.csv", stringsAsFactors = FALSE)
who_uk <- subset(who, country == "United Kingdom")
Let's take a look at the data:
> str(who_uk)
'data.frame': 456 obs. of 6 variables:
$ country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
$ year : int 1979 1979 1979 1979 1979 1979 1979 1979 1979 1979 ...
$ sex : chr "female" "female" "female" "female" ...
$ age : chr "15-24 years" "25-34 years" "35-54 years" "5-14 years" ...
$ suicides_no: int 119 203 617 3 742 171 304 522 970 9 ...
$ population : int 4189200 3917300 6438700 4212200 6191200 2083600 4387000 3991400 6459700 4449000 ...
As the data is split into year
, sex
, and age
we need to transform / summarize it first. Doing it in runtime in ggplot2
is not that optimal. So, how do we do that? There are faster tools around, but wrangling data with dplyr
is probably one of the most approachable methods. Let's take a stab:
library(dplyr)
# All suicides
who_uk_all <- who_uk %>%
group_by(year) %>%
summarize(suicides_no = sum(suicides_no),
population = sum(population)) %>%
mutate(rate = 100000 * suicides_no/population)
# By sex
who_uk_sex <- who_uk %>%
group_by(year, sex) %>%
summarize(suicides_no = sum(suicides_no),
population = sum(population)) %>%
mutate(rate = 100000 * suicides_no / population)
Let's plot it:
ggplot() +
geom_line(data = who_uk_all, aes(year, rate)) +
geom_line(data = who_uk_sex, aes(year, rate, color = sex))
Caveat: Both the source of the data in your image and the way it was transformed is probably different from the WHO data, therefore we also have a slightly different plot (also, it is England, not the UK). Furthermore, it seems really weird that the suicide rate of all persons is higher than both the male and female suicide rate. Further exploration is definitely warranted.
Upvotes: 2