Reputation: 11
I have a data set that has monthly "flows" over 68 years. I am trying to make a comparison of flow distributions by decade by making a plot that has a seasonal distribution on the x-axis and displays a mean value for each decade on the plot.
Upvotes: 0
Views: 467
Reputation: 66445
(Edit: changed from line graph to dodged bar plot, to better align with OP code.)
Here's an approach using dplyr, tidyr, and ggplot2 from tidyverse
.
library(tidyverse)
M %>%
group_by(Decade = floor(Year/10)*10) %>%
summarize_at(vars(Jan:Sep), mean) %>%
# This uses tidyr::pivot_longer to reshape the data longer, which gives us the
# ability to map decade to color.
pivot_longer(-Decade, names_to = "Month", values_to = "Avg") %>%
# This step to get the months to be an ordered factor in order of appearance,
# which is necessary to avoid the months showing up in alphabetical order.
mutate(Month = fct_inorder(Month)) %>%
# Alternatively, we could have aligned these thusly
# mutate(Month_order = match(Month, month.abb)) %>%
# mutate(Month = fct_reorder(Month, Month_order)) %>%
ggplot(aes(Month, Avg, fill = as.factor(Decade))) +
geom_col(position = position_dodge()) +
scale_fill_discrete(name = "Decade")
Upvotes: 1
Reputation: 486
Using your sample data, and the tidyverse
packages, the following code will calculate the average per decade and month:
library(tidyverse)
x <- "Year Jan Feb Mar Apr May Jun Jul Aug Sep
1948 29550 47330 64940 61140 20320 17540 37850 29250 17100
1949 45700 53200 37870 36310 39200 23040 31170 23640 19720
1950 16050 17950 27040 21610 15510 16090 12010 11360 14390
1951 14280 13210 16260 24280 13570 9547 9921 8129 7304
1952 19030 29250 58860 31780 19940 16930 9268 9862 9708
1953 24340 28020 31830 29700 44980 15630 22660 14190 13430
1954 34660 23260 24390 21500 13250 10860 10700 8188 6092
1955 14050 19430 12780 19330 12210 7892 12450 10920 6850
1956 7262 20800 27680 24110 13560 8594 10150 7721 10540
1957 14470 13350 22720 39860 23980 12630 10230 7008 8567"
d <- read_table(x) %>%
mutate(
decade = (Year %/% 10)*10 # add column for decade
) %>%
select(-Year) %>% # remove the year
pivot_longer( # convert to a 'tidy' (long) format
cols = Jan:Sep,
names_to = "month",
values_to = "count"
) %>%
mutate(
month = factor(month, levels = month.abb, ordered = TRUE) # make sure months are ordered
) %>%
group_by(decade, month) %>%
summarise(
mean = mean(count)
)
If you print that dataframe, you get:
> d
# A tibble: 18 x 3
# Groups: decade [2]
decade month mean
<dbl> <ord> <dbl>
1 1940 Jan 37625
2 1940 Feb 50265
3 1940 Mar 51405
4 1940 Apr 48725
5 1940 May 29760
6 1940 Jun 20290
7 1940 Jul 34510
8 1940 Aug 26445
9 1940 Sep 18410
10 1950 Jan 18018.
11 1950 Feb 20659.
12 1950 Mar 27695
13 1950 Apr 26521.
14 1950 May 19625
15 1950 Jun 12272.
16 1950 Jul 12174.
17 1950 Aug 9672.
18 1950 Sep 9610.
If you need it back in wide format:
d2 <- d %>%
pivot_wider(
id_cols = decade,
names_from = month,
values_from = mean
)
> d2
# A tibble: 2 x 10
# Groups: decade [2]
decade Jan Feb Mar Apr May Jun Jul Aug Sep
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1940 37625 50265 51405 48725 29760 20290 34510 26445 18410
2 1950 18018. 20659. 27695 26521. 19625 12272. 12174. 9672. 9610.
Upvotes: 1