Reputation: 63
I have a data frame with daily average temperature data in it, structured like so:
'data.frame': 4666 obs. of 6 variables:
$ Site : chr "EB" "FFCE" "IB" "FFCE" ...
$ Date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2014-01-01" ...
$ Day : int 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 2013 2013 2013 2014 2014 2014 2014 2015 2015 2015 ...
$ Month: int 1 1 1 1 1 1 1 1 1 1 ...
$ Temp : num 28.5 28.3 28.3 27 27.8 ...
i am attempting to produce a summary table which just totals the number of days in a year per site above certain temperatures thresholds e.g 25c, 26c. i can achieve this manually by using dplyr like so-
Days_above = Site_Daily_average %>%
group_by(Year, Site) %>%
summarise("23" = sum(Temp > 23), "24" = sum(Temp > 24),"25"= sum(Temp >
25), "26"= sum(Temp > 26), "27"= sum(Temp > 27), "28"= sum(Temp > 28), "29"
= sum(Temp > 29),"30"= sum(Temp > 30), "31" = sum(Temp > 31), "ABOVE
THRESHOLD" = sum(Temp > maxthreshold))%>% as.data.frame()
Which produces a table like so :
Year Site 23 24 25 26 27 28 29 30 31 ABOVE THRESHOLD
1 2012 EB 142 142 142 91 64 22 0 0 0 0
2 2012 FFCE 238 238 238 210 119 64 0 0 0 0
3 2012 IB 238 238 238 218 138 87 1 0 0 0
4 2013 EB 115 115 115 115 115 109 44 0 0 0
5 2013 FFCE 223 223 216 197 148 114 94 0 0 0
6 2013 IB 365 365 365 348 299 194 135 3 0 0
...
however, as you can see the code is fairly verbose. The problem i am having is producing this same output for a sequence of temperature thresholds, i.e Tempclasses = Seq(16,32,0.25).
As you can see that would take a long time to type that out manually. i feel like this is a very simple calculation and there should be way to use dplyr to recognize each variable in the sequence vector, perform this function and produce an output in a complete table format. sorry if that was unclear as i am relatively new to R, any suggestions would be welcome, thankyou.
Upvotes: 6
Views: 300
Reputation: 93871
Here's a tidyverse
approach, likewise using mtcars
for illustration:
library(tidyverse)
mtcars %>%
mutate(threshold = cut(mpg,
breaks=seq(10, max(mtcars$mpg)+10, 5),
labels=seq(10, max(mtcars$mpg)+5, 5))) %>%
group_by(cyl, threshold) %>%
tally %>%
ungroup %>%
complete(threshold, nesting(cyl), fill=list(n=0)) %>%
arrange(desc(threshold)) %>%
group_by(cyl) %>%
mutate(N_above = cumsum(n)) %>%
select(-n) %>%
arrange(cyl, threshold)
threshold cyl N_above 1 10 4 11 2 15 4 11 3 20 4 11 4 25 4 6 5 30 4 4 6 35 4 0 7 10 6 7 8 15 6 7 9 20 6 3 10 25 6 0 11 30 6 0 12 35 6 0 13 10 8 14 14 15 8 8 15 20 8 0 16 25 8 0 17 30 8 0 18 35 8 0
If you want the final data in wide format, add a spread
at the end and remove the arrange
:
... %>%
select(-n) %>%
spread(threshold, N_above)
cyl 10 15 20 25 30 35 1 4 11 11 11 6 4 0 2 6 7 7 3 0 0 0 3 8 14 8 0 0 0 0
Upvotes: 3
Reputation: 389265
As @dww commented we can use cut
to get the required format. I have tried this on mtcars
dataset where we create range from 10 to 35 with step of 5 for mpg
column.
df <- mtcars
df$group <- cut(df$mpg, seq(10, 35, 5))
and then we group by cyl
and use table
to get count of how many of them fall in the the respective buckets.
table(df$cyl, df$group)
# (10,15] (15,20] (20,25] (25,30] (30,35]
#4 0 0 5 2 4
#6 0 4 3 0 0
#8 6 8 0 0 0
Now , if certain value is greater than 10, it is also greater than 15, hence the number in (15, 20) bucket should also include number from (10,15) bucket and number in (20, 15) bucket should include both the previous number. Hence, we need a row-wise cumsum for this table
t(apply(table(df$cyl, df$group), 1, cumsum))
# (10,15] (15,20] (20,25] (25,30] (30,35]
# 4 0 0 5 7 11
# 6 0 4 7 7 7
# 8 6 14 14 14 14
For your case , the code would go
Site_Daily_average$group <- cut(Site_Daily_average$Temp, seq(16,32,0.25))
#and then do table to get required answer.
t(apply(table(Site_Daily_average$Year,Site_Daily_average$Site,
Site_Daily_average$group), 1, cumsum)
Upvotes: 1