Reputation: 1225
I have a following problem. I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:
df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)
library(dplyr)
df2 <- df1 %>%
group_by(country) %>%
mutate(mean_per_country = mean(temperature))
df2$difference <- df2$temperature - df2$mean_per_country
Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?
Desired output is here:
desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
new_column = c("hot",
"hot",
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"",
"",
"",
"cold",
"cold")
)
Thank you very much
Upvotes: 1
Views: 64
Reputation: 1234
You need to turn the dates to Date
class and then you can calculate the differences between dates. Then group by country and use ifelse()
to set the values if the differences are 1:
require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))
> df2
country date temperature blabla mean_per_country difference new_column
1 01 2020-01-01 4.0 23 0.420000 3.5800000
2 01 2020-01-02 3.0 41 0.420000 2.5800000 hot
3 01 2020-01-03 -2.0 32 0.420000 -2.4200000 cold
4 01 2020-01-05 0.1 8 0.420000 -0.3200000
5 01 2020-01-07 -3.0 50 0.420000 -3.4200000
6 02 2020-01-01 1.5 27 6.750000 -5.2500000
7 02 2020-01-03 12.0 8 6.750000 5.2500000
8 03 2020-01-02 10.0 7 7.333333 2.6666667
9 03 2020-01-03 7.0 6 7.333333 -0.3333333 cold
10 03 2020-01-04 5.0 12 7.333333 -2.3333333 cold
Upvotes: 0
Reputation: 24770
Here's an approach with dplyr
:
library(dplyr)
df2 %>%
group_by(country) %>%
mutate(date = as.Date(date),
consecutive = date - lag(date) == 1,
result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
(sign(difference) == sign(lag(difference)) & consecutive)),
new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups: country [3]
country date temperature blabla mean_per_country difference consecutive result new_column
<chr> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <chr>
1 01 2020-01-01 4 23 0.42 3.58 NA TRUE hot
2 01 2020-01-02 3 41 0.42 2.58 TRUE TRUE hot
3 01 2020-01-03 -2 32 0.42 -2.42 TRUE FALSE NA
4 01 2020-01-05 0.1 8 0.42 -0.32 FALSE FALSE NA
5 01 2020-01-07 -3 50 0.42 -3.42 FALSE NA NA
6 02 2020-01-01 1.5 27 6.75 -5.25 NA NA NA
7 02 2020-01-03 12 8 6.75 5.25 FALSE NA NA
8 03 2020-01-02 10 7 7.33 2.67 NA NA NA
9 03 2020-01-03 7 6 7.33 -0.333 TRUE TRUE cold
10 03 2020-01-04 5 12 7.33 -2.33 TRUE TRUE cold
To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result))
.
Upvotes: 1