Calculate the trend for 2000-rows time series and isolate the abnormal rows

Question

I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.

I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.

Obviously, this is an easy-to-do method, but it is not statistically relevant.

I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.

My dataframe is structured in this way :

shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45  | 15  | 40  | 25  | 25
000002 | 100 | 20  | 20  | 20  | 20  | 20
000003 | 500 | 200 | 0   | 100 | 100 | 100
... (2200 rows)

I've tried to put my timeserie in a list, which is successful, with the following functon:

reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters

Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).

In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...

Thank you for your help! (and merry Christmas)

DatamineR · Accepted Answer

An implementation of mixed model:

install.packages("nlme")
library("nlme")
library(dplyr)

# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)

# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")

Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)

# Extract the random effects for comparison:
random.effects(mod)
(Intercept)        time
1   74.0790805   3.7034172
2    7.8713699   4.2138001
3   -8.0670810  -5.8754060
4  -16.5114428  16.4920663
5  -16.7098229   6.4685228
6  -11.9630688  -8.0411504
7  -12.9669777  21.3071366
8  -24.1099280  32.9274361
9    8.5107335  -9.7976905
10 -13.2707679  -6.6028927
11   3.6206163  -4.1017784
12  21.2342886  -6.7120725
13 -14.6489512  11.6847109
14 -14.7291647   2.1365768
15  10.6791941   3.2097199
16 -14.1524187  -1.6933291
17   5.2120647   8.0119320
18  -2.5172933  -6.5011416
19  -9.0094366  -5.6031271
20   1.4857512  -5.9913865
21 -16.5973442   3.5164298
22 -26.7724763  27.9264081
23  49.0764631 -12.9800871
24  -0.1512509   2.3589947
25  15.7723150  -7.9295698
26   2.1955489  11.0318875
27  -8.0890346  -5.4145977
28   0.1338790  -8.3551182
29   9.7113758  -9.5799588
30  -6.0257683  42.3140432
31 -15.7655545  -8.6226255
32  -4.1450984  18.7995079
33   4.1510104  -1.6384103
34   2.5107652  -2.0871890
35 -23.8640815   7.6680185
36 -10.8228653  -7.7370976
37 -14.1253093  -8.1738468
38  42.4114024  -9.0436585
39 -10.7453627   2.4590883
40 -12.0947901  -5.2763010
41  -7.6578305  -7.9630013
42 -14.9985612  -0.4848326
43 -13.4081771  -7.2655456
44 -11.5646620  -7.5365387
45   6.9116844 -10.5200339
46  70.7785492 -11.5522014
47  -7.3556367  -8.3946072
48  27.3830419  -6.9049164
49  14.3188079  -9.9334156
50 -15.2077850  -7.9161690

I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:

round(apply(random.effects(mod), 2, mean))
(Intercept)        time 
0           0

Calculate the trend for 2000-rows time series and isolate the abnormal rows

Answers (2)

Related Questions