user20203146
user20203146

Reputation: 498

Identify the Start Month and End Month Dynamically For All Multiple Columns Based on Data in R

I have a data frame which is a time series data but multiple items, their data starts from different dates.

I want to figure out a way to dynamically take the data from the month in which the first data point is visible and ignore the 0 values in the start of it and perform outlier cleansing . This is because if I fix the the start of the time frame and end date of time frame, the results are wrong.

I was planning on using a for loop and perform an outlier identification but the issue is I need to find the start date and end date.

The start date is for the cases where there is 0 for at least 3M before we see the first data point and then select the date of the first data point as the start year and month. The end case is for cases when the value is 0 value for 3M after the last data point and the last data point will be selected as the End year and end month. For cases where there is no 0 at the start or end, we can fix the dates.

structure(list(`Row Labels` = c("2019-01-01", "2019-02-01", "2019-03-01", 
"2019-04-01", "2019-05-01", "2019-06-01", "2019-07-01", "2019-08-01", 
"2019-09-01", "2019-10-01", "2019-11-01", "2019-12-01", "2020-01-01", 
"2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", 
"2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01", 
"2020-12-01", "2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01", 
"2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01", 
"2021-10-01", "2021-11-01", "2021-12-01", "2022-01-01", "2022-02-01", 
"2022-03-01", "2022-04-01", "2022-05-01", "2022-06-01", "2022-07-01", 
"2022-08-01", "2022-09-01", "2022-10-01"), `XYZ|146` = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 18, 16, 16, 17, 12, 22, 6, 
7, 6, 0, 15, 0, 17, 17, 5, 19, 16, 7, 25, 19, 34, 26, 41, 50, 
29, 42, 20, 14, 16, 27, 10, 28, 21), `XYZ|666` = c(0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17, 45, 
9, 21, 33, 3, 8, 11, 16, 3, 17, 14, 59, 26, 35, 26, 15, 7, 4, 
4, 2, 7, 6, 2), `XYZ|574` = c(0, 0, 0, 0, 0, 0, 0, 0, 74, 179, 
464, 880, 324, 184, 90, 170, 140, 96, 78, 83, 83, 121, 245, 740, 
332, 123, 117, 138, 20, 42, 70, 70, 42, 103, 490, 641, 488, 245, 
142, 95, 63, 343, 57, 113, 100, 105), `XYZ|851` = c(0, 0, 0, 
0, 0, 0, 0, 0, 0, 206, 1814, 2324, 772, 1116, 1636, 1906, 957, 
829, 911, 786, 938, 1313, 2384, 1554, 1777, 1635, 1534, 1015, 
827, 982, 685, 767, 511, 239, 1850, 1301, 426, 261, 201, 33, 
0, 0, 0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -46L))

So can someone help me identify a method to identify the start date and end dates dynamically in each columns.

The code I have written is.

library(readxl)
library(dplyr)
library(forecast)
Book2 <- read_excel("C:/X/X/X- Y/X/Book5.xlsx")
View(Book2)
dput(Book2)

Dates <- Book2 %>%
  select(`Row Labels`)

for(i in 2:ncol(Book2))
{
  Start_Year = 
  Start_Month =
  End_Year = 
  End_Month = 
    
  Y <- ts(data = Book2[,i],
          frequency = 12,
          start = c(Start_Year,Start_Month),
          end = c(End_Year,End_Month))
  
  autoplot(tsclean(Y),series = "Clean", color = 'red', lwd=0.9) +
    autolayer(Y,series = "original",color = "grey",lwd=1)
}

Can someone help me to figure out how to set out the Start_Year, Start_Month and End_Year, End_Month dynamically based on the above mentioned logic.

Is this doable or is it too difficult?

Upvotes: 0

Views: 75

Answers (1)

r2evans
r2evans

Reputation: 160437

I believe that NAs are much easier to deal with (as far as auto-removal) than 0s, so let's do a rolling-window on the data to NA-ize where three or more are all 0s. (Also, since the rows to remove will vary between columns, there's no way to remove some rows from one column and keep them for another. This way, the frame never changes dims, so it retains its data.frame properties nicely.)

The biggest weakness with this is that it assumes that each row is a month; if you have gaps, you will need to adapt the width= argument to rollapply based on the time spans.

(There is no strict need to define this fun, you can use the rollapply directly as across(.., ~ zoo::rollapply(z, 3, ...)). The reason I defined fun was for terse code.)

fun <- function(z) zoo::rollapply(z, 3, align = "right", partial = TRUE, FUN = function(z) if (all(z %in% c(NA, 0))) z[length(z)][NA] else z[length(z)])

dplyr

library(dplyr)
quux  %>%
  mutate(across(-`Row Labels`, ~ fun(.))) %>%
  print(n=99)
# # A tibble: 46 × 5
#    `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
#    <chr>            <dbl>     <dbl>     <dbl>     <dbl>
#  1 2019-01-01          NA        NA        NA        NA
#  2 2019-02-01          NA        NA        NA        NA
#  3 2019-03-01          NA        NA        NA        NA
#  4 2019-04-01          NA        NA        NA        NA
#  5 2019-05-01          NA        NA        NA        NA
#  6 2019-06-01          NA        NA        NA        NA
#  7 2019-07-01          NA        NA        NA        NA
#  8 2019-08-01          NA        NA        NA        NA
#  9 2019-09-01          NA        NA        74        NA
# 10 2019-10-01          NA        NA       179       206
# 11 2019-11-01          NA        NA       464      1814
# 12 2019-12-01          NA        NA       880      2324
# 13 2020-01-01          12        NA       324       772
# 14 2020-02-01          18        NA       184      1116
# 15 2020-03-01          16        NA        90      1636
# 16 2020-04-01          16        NA       170      1906
# 17 2020-05-01          17        NA       140       957
# 18 2020-06-01          12        NA        96       829
# 19 2020-07-01          22        NA        78       911
# 20 2020-08-01           6        NA        83       786
# 21 2020-09-01           7        NA        83       938
# 22 2020-10-01           6        NA       121      1313
# 23 2020-11-01           0        17       245      2384
# 24 2020-12-01          15        45       740      1554
# 25 2021-01-01           0         9       332      1777
# 26 2021-02-01          17        21       123      1635
# 27 2021-03-01          17        33       117      1534
# 28 2021-04-01           5         3       138      1015
# 29 2021-05-01          19         8        20       827
# 30 2021-06-01          16        11        42       982
# 31 2021-07-01           7        16        70       685
# 32 2021-08-01          25         3        70       767
# 33 2021-09-01          19        17        42       511
# 34 2021-10-01          34        14       103       239
# 35 2021-11-01          26        59       490      1850
# 36 2021-12-01          41        26       641      1301
# 37 2022-01-01          50        35       488       426
# 38 2022-02-01          29        26       245       261
# 39 2022-03-01          42        15       142       201
# 40 2022-04-01          20         7        95        33
# 41 2022-05-01          14         4        63         0
# 42 2022-06-01          16         4       343         0
# 43 2022-07-01          27         2        57        NA
# 44 2022-08-01          10         7       113        NA
# 45 2022-09-01          28         6       100        NA
# 46 2022-10-01          21         2       105        NA

base

quux[-1] <- lapply(quux[-1], fun)
quux
# # A tibble: 46 × 5
#    `Row Labels` `XYZ|146` `XYZ|666` `XYZ|574` `XYZ|851`
#    <chr>            <dbl>     <dbl>     <dbl>     <dbl>
#  1 2019-01-01          NA        NA        NA        NA
#  2 2019-02-01          NA        NA        NA        NA
#  3 2019-03-01          NA        NA        NA        NA
#  4 2019-04-01          NA        NA        NA        NA
#  5 2019-05-01          NA        NA        NA        NA
#  6 2019-06-01          NA        NA        NA        NA
#  7 2019-07-01          NA        NA        NA        NA
#  8 2019-08-01          NA        NA        NA        NA
#  9 2019-09-01          NA        NA        74        NA
# 10 2019-10-01          NA        NA       179       206
# # … with 36 more rows

(all output the same)

Upvotes: 2

Related Questions