Where am I going wrong in spliting time series?

Question

data<-c(10.0,11.1,12.3,13.2,14.8,15.6,16.7,17.5,18.9,19.7,20.7,21.1,22.6,23.5,24.9,25.1,26.3,27.8,28.8,29.6,30.2,31.6,32.1,33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'


df <- ts(cbind(data, startDate, endDate))
df


################

smp_size <- 0.80
train_ind <- length(df) * smp_size

train_split <- seq(from = 1, to = train_ind)
test_split <- seq(from = train_ind +1, to = length(df))

train <- data[train_split]
test <- data[-test_split]

(c(train, test))

I have the above data and I am trying to split it into time series splits, i..e the first 80% as training and the remaining 20% as testing.

I keep getting weird results:

(c(train, test))
 [1] 10.0 11.1 12.3 13.2 14.8 15.6 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2
[22] 31.6 32.1 33.7   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[43]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 10.0 11.1 12.3 13.2 14.8 15.6
[64] 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2 31.6 32.1 33.7

Why are there NA values in the middle of the data?

Roman · Accepted Answer

You should use nrow(df), not length(df) for time-series objects.

data <- c(10.0, 11.1, 12.3, 13.2, 14.8, 15.6, 16.7, 17.5, 18.9,
          19.7, 20.7, 21.1, 22.6, 23.5, 24.9, 25.1, 26.3, 27.8, 
          28.8, 29.6, 30.2, 31.6, 32.1, 33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'

df <- ts(cbind(data, startDate, endDate))

train <- df[1:(nrow(df) * .8), ]
test <- df[-(1:(nrow(df) * .8)), ]

> all.equal(df, ts(rbind(train, test)))
[1] TRUE
> length(df) 
[1] 72
> nrow(df)
[1] 24

Where am I going wrong in spliting time series?

Answers (2)

Related Questions