Lenn
Lenn

Reputation: 1489

good idea to use function that uses a for-loop in dplyr's mutate

So I have a function whos idea it is to operate on a vector of numbers. E.g. a vector of temperatures. I want to compute heatwaves (in a very simplified way...). Lets say a heatwave starts with three consecutive days of above 30 °C.

So I would need a back-reference to store how long the current heatwave already is. I wrote a function that uses a for-loop internally. In pseudo-code it kind of looks like this:

is_heatwave = function(vals){
  
  length_heatwave = 0
  
  # returns a vector with the length of the input vals
  day_in_heatwave = vector(length=length(vals))
  days_in_current_heatwave =c()
  
  for(i in 1:length(vals)){
    val = vals[[i]]
    
    if(val > 30){
      length_heatwave = length_heatwave + 1
      days_in_current_heatwave = c(days_in_current_heatwave, i)
    }else{
      length_heatwave = 0
    }
    
    ... some more code
  }
  
  return(day_in_heatwave)
    
}

This code might be wrong. But the idea is that the function takes as input a vector with the length as the data.frame has rows. And returns a vector of the same length.

my idea is to have a function that I can use like this:

df = data.frame(
  temps = c(30,30,32,30,24)
)

df %>% mutate(is_heatwave = is_heatwave(temps))

I just wanted to ask if this generally is a good idea or are there any better ideas?

Upvotes: 0

Views: 68

Answers (3)

Roman
Roman

Reputation: 17648

You can try

set.seed;df = data.frame(
  temps = sample(25:40, 100,replace = T)
)
df %>% 
  mutate(heatwave_length = cumsum(temps>=30)-cummax((temps<30)*cumsum(temps>=30)))%>% 
   as_tibble()
# A tibble: 100 × 2
   temps heatwave_length
   <int>           <int>
 1    33               1
 2    38               2
 3    30               3
 4    35               4
 5    30               5
 6    37               6
 7    35               7
 8    35               8
 9    38               9
10    29               0

The max number can get filtered by using sth like

mutate(max = ifelse(lead(heatwave_length) == 0, heatwave_length, NA)) 

Upvotes: 2

Adriano Mello
Adriano Mello

Reputation: 2132

Already good answers, so let's add some nuances.

This solution gives an unique streak_id that may or may not be a heat_wave. hot_days_acc is the number of hot days accumulated on a streak.

The code:

# library(tidyverse)

# -------------------     
# Number of days in a heat wave
heat_wave_days <- 3

# Temperature threshold 
hot_day <- 30

# Some toy data
set.seed(100)
aux_df <- tibble(temp = sample(-2:2 + hot_day, 50, replace = TRUE))

#
aux_df <- aux_df %>% 
  mutate(
    hot_days_acc = if_else(temp >= hot_day, TRUE, FALSE),
    streak_id = consecutive_id(hot_days_acc)) %>% 
  
  add_count(streak_id, name = "heat_wave") %>% 

  mutate(
    .by = streak_id, 
    heat_wave = if_else(
      all(hot_days_acc == TRUE) & heat_wave >= heat_wave_days, 
      TRUE, FALSE)) %>% 
  
  mutate(streak_id = consecutive_id(heat_wave)) %>% 
  mutate(.by = streak_id, hot_days_acc = cumsum(hot_days_acc)) %>% 
  
  relocate(temp, streak_id, heat_wave, hot_days_acc)

The output:

> print(aux_df, n = nrow(aux_df))
# A tibble: 50 × 4
    temp streak_id heat_wave hot_days_acc
   <dbl>     <int> <lgl>            <int>
 1    29         1 FALSE                0
 2    30         1 FALSE                1
 3    28         1 FALSE                1
 4    29         1 FALSE                1
 5    31         1 FALSE                2
 6    31         1 FALSE                3
 7    29         1 FALSE                3
 8    30         1 FALSE                4
 9    29         1 FALSE                4
10    32         2 TRUE                 1
11    31         2 TRUE                 2
12    30         2 TRUE                 3
13    30         2 TRUE                 4
14    29         3 FALSE                0
15    28         3 FALSE                0
16    29         3 FALSE                0
17    30         4 TRUE                 1
18    31         4 TRUE                 2
19    31         4 TRUE                 3
20    31         4 TRUE                 4
21    32         4 TRUE                 5
22    30         4 TRUE                 6
23    28         5 FALSE                0
24    30         5 FALSE                1
25    31         5 FALSE                2
26    29         5 FALSE                2
27    32         6 TRUE                 1
28    32         6 TRUE                 2
29    32         6 TRUE                 3
30    28         7 FALSE                0
31    32         8 TRUE                 1
32    31         8 TRUE                 2
33    30         8 TRUE                 3
34    28         9 FALSE                0
35    28         9 FALSE                0
36    28         9 FALSE                0
37    30         9 FALSE                1
38    28         9 FALSE                1
39    28         9 FALSE                1
40    31        10 TRUE                 1
41    30        10 TRUE                 2
42    32        10 TRUE                 3
43    30        10 TRUE                 4
44    31        10 TRUE                 5
45    30        10 TRUE                 6
46    30        10 TRUE                 7
47    30        10 TRUE                 8
48    31        10 TRUE                 9
49    30        10 TRUE                10
50    32        10 TRUE                11

Upvotes: 2

Captain Hat
Captain Hat

Reputation: 3247

You can do this very concisely with dplyr::consecutive_id(), which creates a grouping variable that increments whenever another variable changes. By creating a variable that represents hot days, we can then create groups that correspond to waves of hot and cold. We can then count the number of days in a group or determine which day of a heatwave we are in:

library(dplyr)

df <- data.frame(temps = c(30, 30, 30, 29, 30, 29, 29, 30, 29, 30, 30))

df <- mutate(df,
             hot = temps >= 30, 
             wave = consecutive_id(hot)) |> 
  mutate(heatwave_length = sum(hot),
         wave_day = 1:n() |>
           replace(!hot, NA),
         .by = wave) |> 
  select(temps, heatwave_length, wave_day)

df
#>    temps heatwave_length wave_day
#> 1     30               3        1
#> 2     30               3        2
#> 3     30               3        3
#> 4     29               0       NA
#> 5     30               1        1
#> 6     29               0       NA
#> 7     29               0       NA
#> 8     30               1        1
#> 9     29               0       NA
#> 10    30               2        1
#> 11    30               2        2

Created on 2024-04-09 with reprex v2.1.0

How does consecutive_id() work?

Simple:

  • Create a 'lagged' version of x (i.e. x_lag[n] is equal to x[n+1])
  • Use this to evaluate whether each element of x is the same as the previous element (i.e. x == x_lag) (default to TRUE for x[1])
  • Whenever this is TRUE, the value has changed. We can therefore create a cumulative sum of our new variable which will increment by 1 every time the group changes.

Here it is in base R:

base_consecutive_id <- function(x){
  len <- length(x)
  
  c(TRUE, x[2:len] != x[1:(len-1)]) |> 
    cumsum()
}

Created on 2024-04-09 with reprex v2.1.0

Upvotes: 3

Related Questions