civy
civy

Reputation: 423

Sequentially calculate changes in R data frame value based on categorical variables

I have constructed a set of indicators for weather prediction and I would like to backtest my algorithm by using different sets of them in R. Although I had some experience with merging using data table I cannot figure out a way to do this particular task. Here's a portion of my data

Time TempC Index1 Index2 Index3 Index4
1    25.0  Up     Down   Down   Down
2    25.2  Up     Up     Up     Down
3    25.6  Up     Up     Up     Up
4    25.0  Down   Down   Down   Up
5    24.7  Down   Down   Down   Down
6    23.6  Down   Down   Down   Down
7    28.9  Up     Up     Up     Down
8    30.0  Up     Up     Up     Up

Ideally I would pick a set of indicators let's say Index1 and Index2 and calculate the differences based on the first and last instance of Up and Down signals. For example when taking Index1 as the only predictor;

Time TempC       TempDiff 
1,3  25.0, 25.6  0.6
4,6  25.0, 23.6  1.4
7,8  28.9, 30.0  1.1

And when taking both Index1 and Index2;

Time TempC       TempDiff 
2,3  25.2, 25.6  0.4
4,6  25.0, 23.6  1.4
7,8  28.9, 30.0  1.1

What matters to me the most is the difference. Concatenating start and end temperature and time scales with commas is completely optional. Any help would be greatly appreciated.

edit

I would be interested to also know if there's any way to ignore intermediate sets of values that do not meet the criteria in the case of two or more indices like the following;

Time TempC Index1 Index2 Index3 Index4
1    25.0  Up     Down   Down   Down
2    25.2  Up     Up     Up     Down
3    25.6  Up     Up     Up     Up
4    25.0  Down   Down   Down   Up
5    24.7  Down   Down   Down   Down
6    23.6  Down   Down   Down   Down
7    28.9  Up     Up     Up     Down
8    29.3  Up     Down   Down   Down
9    30.0  Up     Up     Up     Up

The desired result above when using both Index1 and Index2 would be to ignore the 8th line and normally calculate the difference with lines 7 and 9.

Finally, it would be useful to add another column indicating the direction (Up or Down) which will let me quantify the reliability of the model like this

Time TempC       TempDiff  Direction
2,3  25.2, 25.6  0.4       Up
4,6  25.0, 23.6  -1.4      Down
7,8  28.9, 30.0  1.1       Up

Upvotes: 0

Views: 58

Answers (1)

www
www

Reputation: 39154

A solution using and .

library(dplyr)
library(data.table)

dat2 <- dat %>%
  # Create Run Length ID
  mutate(ID = rleid(Index1)) %>%
  group_by(ID) %>%
  # Filter groups with n > 1
  filter(n() > 1) %>%
  # Summarise the data by first and last value of each group
  summarise(Time = paste(first(Time), last(Time), sep = ", "),
            TempRange = paste(first(TempC), last(TempC), sep = ", "),
            TempDiff = abs(first(TempC) - last(TempC))) %>%
  ungroup() %>%
  select(-ID)

dat2
# # A tibble: 3 x 3
#  Time  TempRange TempDiff
#  <chr> <chr>        <dbl>
# 1 1, 3  25, 25.6     0.600
# 2 4, 6  25, 23.6     1.40 
# 3 7, 8  28.9, 30     1.10 

If you want to consider more than one columns, you can provide multiple columns to the rleid function.

dat3 <- dat %>%
  # Create Run Length ID
  mutate(ID = rleid(Index1, Index2)) %>%
  group_by(ID) %>%
  # Filter groups with n > 1
  filter(n() > 1) %>%
  # Summarise the data by first and last value of each group
  summarise(Time = paste(first(Time), last(Time), sep = ", "),
            TempRange = paste(first(TempC), last(TempC), sep = ", "),
            TempDiff = abs(first(TempC) - last(TempC))) %>%
  ungroup() %>%
  select(-ID)

dat3

# # A tibble: 3 x 3
#   Time  TempRange  TempDiff
#   <chr> <chr>         <dbl>
# 1 2, 3  25.2, 25.6    0.400
# 2 4, 6  25, 23.6      1.40 
# 3 7, 8  28.9, 30      1.10

Upvotes: 3

Related Questions