Reputation: 498
I have a data frame which is a time series data but multiple items, their data starts from different dates.
I want to figure out a way to dynamically take the data from the month in which the first data point is visible and ignore the 0 values in the start of it and perform outlier cleansing . This is because if I fix the the start of the time frame and end date of time frame, the results are wrong.
I was planning on using a for loop and perform an outlier identification but the issue is I need to find the start date and end date.
The start date is for the cases where there is 0 for at least 3M before we see the first data point and then select the date of the first data point as the start year and month. The end case is for cases when the value is 0 value for 3M after the last data point and the last data point will be selected as the End year and end month. For cases where there is no 0 at the start or end, we can fix the dates.
structure(list(`Row Labels` = c("2019-01-01", "2019-02-01", "2019-03-01",
"2019-04-01", "2019-05-01", "2019-06-01", "2019-07-01", "2019-08-01",
"2019-09-01", "2019-10-01", "2019-11-01", "2019-12-01", "2020-01-01",
"2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01",
"2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01",
"2020-12-01", "2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01",
"2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01",
"2021-10-01", "2021-11-01", "2021-12-01", "2022-01-01", "2022-02-01",
"2022-03-01", "2022-04-01", "2022-05-01", "2022-06-01", "2022-07-01",
"2022-08-01", "2022-09-01", "2022-10-01"), `XYZ|146` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 18, 16, 16, 17, 12, 22, 6,
7, 6, 0, 15, 0, 17, 17, 5, 19, 16, 7, 25, 19, 34, 26, 41, 50,
29, 42, 20, 14, 16, 27, 10, 28, 21), `XYZ|666` = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17, 45,
9, 21, 33, 3, 8, 11, 16, 3, 17, 14, 59, 26, 35, 26, 15, 7, 4,
4, 2, 7, 6, 2), `XYZ|574` = c(0, 0, 0, 0, 0, 0, 0, 0, 74, 179,
464, 880, 324, 184, 90, 170, 140, 96, 78, 83, 83, 121, 245, 740,
332, 123, 117, 138, 20, 42, 70, 70, 42, 103, 490, 641, 488, 245,
142, 95, 63, 343, 57, 113, 100, 105), `XYZ|851` = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 206, 1814, 2324, 772, 1116, 1636, 1906, 957,
829, 911, 786, 938, 1313, 2384, 1554, 1777, 1635, 1534, 1015,
827, 982, 685, 767, 511, 239, 1850, 1301, 426, 261, 201, 33,
0, 0, 0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -46L))
So can someone help me identify a method to identify the start date and end dates dynamically in each columns.
The code I have written is.
library(readxl)
library(dplyr)
library(forecast)
Book2 <- read_excel("C:/X/X/X- Y/X/Book5.xlsx")
View(Book2)
dput(Book2)
Dates <- Book2 %>%
select(`Row Labels`)
for(i in 2:ncol(Book2))
{
Start_Year =
Start_Month =
End_Year =
End_Month =
Y <- ts(data = Book2[,i],
frequency = 12,
start = c(Start_Year,Start_Month),
end = c(End_Year,End_Month))
autoplot(tsclean(Y),series = "Clean", color = 'red', lwd=0.9) +
autolayer(Y,series = "original",color = "grey",lwd=1)
}
Can someone help me to figure out how to set out the Start_Year, Start_Month and End_Year, End_Month dynamically based on the above mentioned logic.
Is this doable or is it too difficult?
Upvotes: 0
Views: 75
Reputation: 160437
I believe that NA
s are much easier to deal with (as far as auto-removal) than 0
s, so let's do a rolling-window on the data to NA
-ize where three or more are all 0s. (Also, since the rows to remove will vary between columns, there's no way to remove some rows from one column and keep them for another. This way, the frame never changes dims, so it retains its data.frame
properties nicely.)
The biggest weakness with this is that it assumes that each row is a month; if you have gaps, you will need to adapt the width=
argument to rollapply
based on the time spans.
(There is no strict need to define this fun
, you can use the rollapply
directly as across(.., ~ zoo::rollapply(z, 3, ...))
. The reason I defined fun
was for terse code.)
fun <- function(z) zoo::rollapply(z, 3, align = "right", partial = TRUE, FUN = function(z) if (all(z %in% c(NA, 0))) z[length(z)][NA] else z[length(z)])
library(dplyr)
quux %>%
mutate(across(-`Row Labels`, ~ fun(.))) %>%
print(n=99)
# # A tibble: 46 × 5
# `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-01-01 NA NA NA NA
# 2 2019-02-01 NA NA NA NA
# 3 2019-03-01 NA NA NA NA
# 4 2019-04-01 NA NA NA NA
# 5 2019-05-01 NA NA NA NA
# 6 2019-06-01 NA NA NA NA
# 7 2019-07-01 NA NA NA NA
# 8 2019-08-01 NA NA NA NA
# 9 2019-09-01 NA NA 74 NA
# 10 2019-10-01 NA NA 179 206
# 11 2019-11-01 NA NA 464 1814
# 12 2019-12-01 NA NA 880 2324
# 13 2020-01-01 12 NA 324 772
# 14 2020-02-01 18 NA 184 1116
# 15 2020-03-01 16 NA 90 1636
# 16 2020-04-01 16 NA 170 1906
# 17 2020-05-01 17 NA 140 957
# 18 2020-06-01 12 NA 96 829
# 19 2020-07-01 22 NA 78 911
# 20 2020-08-01 6 NA 83 786
# 21 2020-09-01 7 NA 83 938
# 22 2020-10-01 6 NA 121 1313
# 23 2020-11-01 0 17 245 2384
# 24 2020-12-01 15 45 740 1554
# 25 2021-01-01 0 9 332 1777
# 26 2021-02-01 17 21 123 1635
# 27 2021-03-01 17 33 117 1534
# 28 2021-04-01 5 3 138 1015
# 29 2021-05-01 19 8 20 827
# 30 2021-06-01 16 11 42 982
# 31 2021-07-01 7 16 70 685
# 32 2021-08-01 25 3 70 767
# 33 2021-09-01 19 17 42 511
# 34 2021-10-01 34 14 103 239
# 35 2021-11-01 26 59 490 1850
# 36 2021-12-01 41 26 641 1301
# 37 2022-01-01 50 35 488 426
# 38 2022-02-01 29 26 245 261
# 39 2022-03-01 42 15 142 201
# 40 2022-04-01 20 7 95 33
# 41 2022-05-01 14 4 63 0
# 42 2022-06-01 16 4 343 0
# 43 2022-07-01 27 2 57 NA
# 44 2022-08-01 10 7 113 NA
# 45 2022-09-01 28 6 100 NA
# 46 2022-10-01 21 2 105 NA
quux[-1] <- lapply(quux[-1], fun)
quux
# # A tibble: 46 × 5
# `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-01-01 NA NA NA NA
# 2 2019-02-01 NA NA NA NA
# 3 2019-03-01 NA NA NA NA
# 4 2019-04-01 NA NA NA NA
# 5 2019-05-01 NA NA NA NA
# 6 2019-06-01 NA NA NA NA
# 7 2019-07-01 NA NA NA NA
# 8 2019-08-01 NA NA NA NA
# 9 2019-09-01 NA NA 74 NA
# 10 2019-10-01 NA NA 179 206
# # … with 36 more rows
(all output the same)
Upvotes: 2