OverFlow Police
OverFlow Police

Reputation: 861

use cut in R so that unmatched intervals are included

I have a data set like this:

sum_col   city    scen    model   time_period   chill_season
110.02     NY      RCP_8   bcc     2076_2099     season_2085_2086
91.26      NY      RCP_8   bcc     2076_2099     season_2086_2087
91.05      NY      RCP_8   bcc     2076_2099     season_2087_2088
74.96      NY      RCP_8   bcc     2076_2099     season_2088_2089
77.97      NY      RCP_8   bcc     2076_2099     season_2089_2090
109.05     NY      RCP_8   bcc     2076_2099     season_2090_2091

I want to cut the sum_col column and count how many times, the values fall within each interval bks = c(-300, seq(20, 75, 5), 300).

However, when I try the following:

result <- dt %>%
          mutate(thresh_range = cut(sum_col, breaks = bks)) %>%
          group_by(time_period, thresh_range, model, scen, city) %>%
          summarize(no_years = n_distinct(chill_season, na.rm = FALSE)) %>% 
          data.table()

my result looks like:

time_period   thresh_range  model   scen    city   no_years
  2076_2099      (70,75]      bcc   RCP_8     NY     1
  2076_2099     (75,300]      bcc   RCP_8     NY     5

So, the intervals that are less than 70, e.g. (20, 25), (25, 30), are not created (because there is no row in data that falls within those intervals).

Is there anyway to tell the cut, to return zero for those intervals?

Please note, again, that a row similar to the following:

 a_value_leass_than_70_here  NY   RCP_8  bcc 2076_2099  chill_2076_2077

whose corresponding sum_col is less than 70 does not exist in the data, however, I was wondering if it is possible for such a non-existing data, cut can create a 0 or NA that tells us the temperature of NY, with those parameters indeed did not fall in (20, 25) interval.

The bottom line is that I want to see how many years, each city with a given set of parameters (model, scen, etc) falls within each interval, (20, 25), (25,30), etc.,

If any suggestion other that cut works, that is great as well.

Upvotes: 1

Views: 293

Answers (1)

Weihuang Wong
Weihuang Wong

Reputation: 13108

You can use the complete function from the tidyr package to create NA rows for missing combinations of data:

library(tidyr)
result <- dt %>%
          mutate(thresh_range = cut(sum_col, breaks = bks)) %>%
          complete(time_period, thresh_range, model, scen, city) %>%
          group_by(time_period, thresh_range, model, scen, city) %>%
          summarize(no_years = n_distinct(chill_season, na.rm = TRUE)) 
result
# # A tibble: 13 x 6
# # Groups:   time_period, thresh_range, model, scen [?]
#    time_period thresh_range model scen  city  no_years
#    <chr>       <fct>        <chr> <chr> <chr>    <int>
#  1 2076_2099   (-300,20]    bcc   RCP_8 NY           0
#  2 2076_2099   (20,25]      bcc   RCP_8 NY           0
#  3 2076_2099   (25,30]      bcc   RCP_8 NY           0
#  4 2076_2099   (30,35]      bcc   RCP_8 NY           0
#  5 2076_2099   (35,40]      bcc   RCP_8 NY           0
#  6 2076_2099   (40,45]      bcc   RCP_8 NY           0
#  7 2076_2099   (45,50]      bcc   RCP_8 NY           0
#  8 2076_2099   (50,55]      bcc   RCP_8 NY           0
#  9 2076_2099   (55,60]      bcc   RCP_8 NY           0
# 10 2076_2099   (60,65]      bcc   RCP_8 NY           0
# 11 2076_2099   (65,70]      bcc   RCP_8 NY           0
# 12 2076_2099   (70,75]      bcc   RCP_8 NY           1
# 13 2076_2099   (75,300]     bcc   RCP_8 NY           5

Upvotes: 2

Related Questions