Reputation: 423
I have constructed a set of indicators for weather prediction and I would like to backtest my algorithm by using different sets of them in R. Although I had some experience with merging using data table I cannot figure out a way to do this particular task. Here's a portion of my data
Time TempC Index1 Index2 Index3 Index4
1 25.0 Up Down Down Down
2 25.2 Up Up Up Down
3 25.6 Up Up Up Up
4 25.0 Down Down Down Up
5 24.7 Down Down Down Down
6 23.6 Down Down Down Down
7 28.9 Up Up Up Down
8 30.0 Up Up Up Up
Ideally I would pick a set of indicators let's say Index1 and Index2 and calculate the differences based on the first and last instance of Up and Down signals. For example when taking Index1 as the only predictor;
Time TempC TempDiff
1,3 25.0, 25.6 0.6
4,6 25.0, 23.6 1.4
7,8 28.9, 30.0 1.1
And when taking both Index1 and Index2;
Time TempC TempDiff
2,3 25.2, 25.6 0.4
4,6 25.0, 23.6 1.4
7,8 28.9, 30.0 1.1
What matters to me the most is the difference. Concatenating start and end temperature and time scales with commas is completely optional. Any help would be greatly appreciated.
edit
I would be interested to also know if there's any way to ignore intermediate sets of values that do not meet the criteria in the case of two or more indices like the following;
Time TempC Index1 Index2 Index3 Index4
1 25.0 Up Down Down Down
2 25.2 Up Up Up Down
3 25.6 Up Up Up Up
4 25.0 Down Down Down Up
5 24.7 Down Down Down Down
6 23.6 Down Down Down Down
7 28.9 Up Up Up Down
8 29.3 Up Down Down Down
9 30.0 Up Up Up Up
The desired result above when using both Index1 and Index2 would be to ignore the 8th line and normally calculate the difference with lines 7 and 9.
Finally, it would be useful to add another column indicating the direction (Up or Down) which will let me quantify the reliability of the model like this
Time TempC TempDiff Direction
2,3 25.2, 25.6 0.4 Up
4,6 25.0, 23.6 -1.4 Down
7,8 28.9, 30.0 1.1 Up
Upvotes: 0
Views: 58
Reputation: 39154
A solution using dplyr and data.table.
library(dplyr)
library(data.table)
dat2 <- dat %>%
# Create Run Length ID
mutate(ID = rleid(Index1)) %>%
group_by(ID) %>%
# Filter groups with n > 1
filter(n() > 1) %>%
# Summarise the data by first and last value of each group
summarise(Time = paste(first(Time), last(Time), sep = ", "),
TempRange = paste(first(TempC), last(TempC), sep = ", "),
TempDiff = abs(first(TempC) - last(TempC))) %>%
ungroup() %>%
select(-ID)
dat2
# # A tibble: 3 x 3
# Time TempRange TempDiff
# <chr> <chr> <dbl>
# 1 1, 3 25, 25.6 0.600
# 2 4, 6 25, 23.6 1.40
# 3 7, 8 28.9, 30 1.10
If you want to consider more than one columns, you can provide multiple columns to the rleid
function.
dat3 <- dat %>%
# Create Run Length ID
mutate(ID = rleid(Index1, Index2)) %>%
group_by(ID) %>%
# Filter groups with n > 1
filter(n() > 1) %>%
# Summarise the data by first and last value of each group
summarise(Time = paste(first(Time), last(Time), sep = ", "),
TempRange = paste(first(TempC), last(TempC), sep = ", "),
TempDiff = abs(first(TempC) - last(TempC))) %>%
ungroup() %>%
select(-ID)
dat3
# # A tibble: 3 x 3
# Time TempRange TempDiff
# <chr> <chr> <dbl>
# 1 2, 3 25.2, 25.6 0.400
# 2 4, 6 25, 23.6 1.40
# 3 7, 8 28.9, 30 1.10
Upvotes: 3