Reputation: 518
I have a data.table of 59101 observations and I want to make a line plot, not a barplot, of proportions of my exposure
column by the unique values in the month.year
column (i.e. Jan 2014,Feb 2014,...,March 2020).
library(data.table)
set.seed(1992)
DT <- data.table::data.table(ID=seq(1,59101),
exposure=as.factor(sample(letters[1:10],59101,replace = TRUE)),
index_date=sample(seq(as.Date('2014/01/01'), as.Date('2020/03/31'), by="day"),59101,replace = TRUE))
DT[,month.year:=as.factor(zoo::as.yearmon(index_date)),]
DT
> DT
ID exposure index_date month.year
1: 1 g 2017-11-26 Nov 2017
2: 2 i 2019-10-11 Oct 2019
3: 3 i 2015-02-27 Feb 2015
4: 4 f 2016-10-18 Oct 2016
5: 5 e 2019-06-06 Jun 2019
---
59097: 59097 e 2015-07-22 Jul 2015
59098: 59098 j 2017-09-04 Sep 2017
59099: 59099 a 2018-04-26 Apr 2018
59100: 59100 a 2019-12-02 Dec 2019
59101: 59101 g 2014-11-04 Nov 2014
I want my data to look like the image I attached to my question (seen below)
I've been able to produce plots with ggplot in the past but I found that I struggled most with prepping the data, so my code ends up being rather rudimentary and ad hoc; I would like to keep my code as simple and clean as possible.
I've seen that melting data is the best approach when it comes to plotting with ggplot but I haven't built the brain muscle memory to instinctly know how to 1) prep the data in that manner and 2) pass the data through ggplot syntax to create what I need.
If anyone knows or has advice of how to do this, it would be greatly appreciated.
Upvotes: 2
Views: 2004
Reputation: 11255
Since this includes the data.table tag, here's a way to prep your data:
DT[,
{ n = .N
.SD[, .(rel_freq = .N / n), by = exposure]},
by = month.year]
month.year exposure rel_freq
<fctr> <fctr> <num>
1: Nov 2017 g 0.10840108
2: Nov 2017 f 0.10027100
3: Nov 2017 d 0.10162602
4: Nov 2017 i 0.09485095
5: Nov 2017 e 0.11382114
---
746: Jul 2018 f 0.10506799
747: Jul 2018 c 0.10259580
748: Jul 2018 a 0.10754017
749: Jul 2018 b 0.10135970
750: Jul 2018 g 0.11248455
Then you have a few options for the ggplot2 call.
DT_relative = DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year]
ggplot(DT_relative, aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()
ggplot
callggplot(DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year],
aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()
[data.table
call and use ggplot(.SD)
DT[, {n = .N; .SD[, .(rel_freq = .N / n), by = exposure]}, by = month.year
][, ggplot(.SD, aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()]
magrittr
to use the pipelibrary(magrittr)
DT[,
{ n = .N
.SD[, .(rel_freq = .N / n), by = exposure]},
by = month.year]%>%
ggplot(., aes(x = month.year, y = rel_freq, color = exposure, group = exposure)) + geom_line()
The other answer includes great formatting for the ggplot()
call so please see the other answer for how to make the graph look nice.
Upvotes: 1
Reputation: 1430
Here's a solution using tidyverse. It's a lot of data to plot, sometimes small multiples would be better.
library(data.table)
set.seed(1992)
DT <- data.table::data.table(ID=seq(1,59101),
exposure=as.factor(sample(letters[1:10],59101,replace = TRUE)),
index_date=sample(seq(as.Date('2014/01/01'), as.Date('2020/03/31'), by="day"),59101,replace = TRUE))
DT[,month.year:=as.factor(zoo::as.yearmon(index_date)),]
library(tidyverse)
DT %>% as_tibble() %>%
group_by(month.year, exposure) %>%
count() %>%
ungroup() %>%
group_by(month.year) %>%
mutate(ttl = sum(n),
pct_ttl = n / ttl,
date = lubridate::myd(month.year, truncated = 1L)) %>%
print(n = 20) %>%
ggplot(aes(date, pct_ttl, color = exposure, group = exposure)) +
geom_line() +
scale_x_date(scale_x_date(date_breaks = "4 months", date_labels = "%b %Y"))+
scale_y_continuous(label = scales::percent_format(accuracy = 1)) +
theme(axis.text.x = element_text(angle = 90))+
labs(y = "proportion", x = "")
#> # A tibble: 750 x 6
#> # Groups: month.year [75]
#> month.year exposure n ttl pct_ttl date
#> <fct> <fct> <int> <int> <dbl> <date>
#> 1 Jan 2014 a 66 793 0.0832 2014-01-01
#> 2 Jan 2014 b 83 793 0.105 2014-01-01
#> 3 Jan 2014 c 66 793 0.0832 2014-01-01
#> 4 Jan 2014 d 93 793 0.117 2014-01-01
#> 5 Jan 2014 e 76 793 0.0958 2014-01-01
#> 6 Jan 2014 f 71 793 0.0895 2014-01-01
#> 7 Jan 2014 g 87 793 0.110 2014-01-01
#> 8 Jan 2014 h 77 793 0.0971 2014-01-01
#> 9 Jan 2014 i 87 793 0.110 2014-01-01
#> 10 Jan 2014 j 87 793 0.110 2014-01-01
#> 11 Feb 2014 a 79 708 0.112 2014-02-01
#> 12 Feb 2014 b 66 708 0.0932 2014-02-01
#> 13 Feb 2014 c 69 708 0.0975 2014-02-01
#> 14 Feb 2014 d 69 708 0.0975 2014-02-01
#> 15 Feb 2014 e 69 708 0.0975 2014-02-01
#> 16 Feb 2014 f 78 708 0.110 2014-02-01
#> 17 Feb 2014 g 71 708 0.100 2014-02-01
#> 18 Feb 2014 h 67 708 0.0946 2014-02-01
#> 19 Feb 2014 i 65 708 0.0918 2014-02-01
#> 20 Feb 2014 j 75 708 0.106 2014-02-01
#> # ... with 730 more rows
Created on 2020-04-24 by the reprex package (v0.3.0)
Upvotes: 3