vojtam
vojtam

Reputation: 1225

new column based on two rows from consecutive days in R

I have a following problem. I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:

df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
                  date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02",  "2020-01-03", "2020-01-04"),
                  temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
                  blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)


library(dplyr)

df2 <- df1 %>%
  group_by(country) %>%
  mutate(mean_per_country = mean(temperature))


df2$difference <- df2$temperature - df2$mean_per_country

Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?

Desired output is here:

desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
                         date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02",  "2020-01-03", "2020-01-04"),
                         temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
                         blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
                         mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
                         difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
                         new_column = c("hot",
                                        "hot",
                                        "", #day interrupted, therefor not "cold"
                                        "", #day interrupted, therefor not "cold"
                                        "", #day interrupted, therefor not "cold"
                                        "", 
                                        "", 
                                        "", 
                                        "cold",
                                        "cold")
                         
)

Thank you very much

Upvotes: 1

Views: 64

Answers (2)

VitaminB16
VitaminB16

Reputation: 1234

You need to turn the dates to Date class and then you can calculate the differences between dates. Then group by country and use ifelse() to set the values if the differences are 1:

require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
        plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))

> df2
   country       date temperature blabla mean_per_country difference new_column
1       01 2020-01-01         4.0     23         0.420000  3.5800000       
2       01 2020-01-02         3.0     41         0.420000  2.5800000        hot
3       01 2020-01-03        -2.0     32         0.420000 -2.4200000       cold
4       01 2020-01-05         0.1      8         0.420000 -0.3200000           
5       01 2020-01-07        -3.0     50         0.420000 -3.4200000           
6       02 2020-01-01         1.5     27         6.750000 -5.2500000           
7       02 2020-01-03        12.0      8         6.750000  5.2500000           
8       03 2020-01-02        10.0      7         7.333333  2.6666667           
9       03 2020-01-03         7.0      6         7.333333 -0.3333333       cold
10      03 2020-01-04         5.0     12         7.333333 -2.3333333       cold

Upvotes: 0

Ian Campbell
Ian Campbell

Reputation: 24770

Here's an approach with dplyr:

library(dplyr)
df2 %>%
   group_by(country) %>%
   mutate(date = as.Date(date),
          consecutive = date - lag(date) == 1,
          result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
                   (sign(difference) == sign(lag(difference)) & consecutive)),
          new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups:   country [3]
   country date       temperature blabla mean_per_country difference consecutive result new_column
   <chr>   <date>           <dbl>  <dbl>            <dbl>      <dbl> <lgl>       <lgl>  <chr>     
 1 01      2020-01-01         4       23             0.42      3.58  NA          TRUE   hot       
 2 01      2020-01-02         3       41             0.42      2.58  TRUE        TRUE   hot       
 3 01      2020-01-03        -2       32             0.42     -2.42  TRUE        FALSE  NA        
 4 01      2020-01-05         0.1      8             0.42     -0.32  FALSE       FALSE  NA        
 5 01      2020-01-07        -3       50             0.42     -3.42  FALSE       NA     NA        
 6 02      2020-01-01         1.5     27             6.75     -5.25  NA          NA     NA        
 7 02      2020-01-03        12        8             6.75      5.25  FALSE       NA     NA        
 8 03      2020-01-02        10        7             7.33      2.67  NA          NA     NA        
 9 03      2020-01-03         7        6             7.33     -0.333 TRUE        TRUE   cold      
10 03      2020-01-04         5       12             7.33     -2.33  TRUE        TRUE   cold      

To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result)).

Upvotes: 1

Related Questions