Reputation: 25547
I have multiple sets of time series data and would like help figuring out the best way to get them into R and analyze them with R. I'm pretty familiar with data.table but not so familiar with R's ts
class supporting time series analysis.
In particular, I want to known how to use ts
in this situation or if there are limitations in ts
(such as problems aggregating a set of ts
objects) that make it in appropriate to use here.
There are a large number of stores. For each store, I have multiple data points for each day, such as sales volume in dollars, sales volume in number of transactions, and store traffic (number of people entering the store). (Actually what I have is a table with columns store ID, date, and the data for that store and date.)
What I've been doing is using a data.table with one row per store, aggregating the data by store into months and storing the values for each month in a separate named column (e.g. jan14_dollars, feb14_dollars...) but this is unwieldy for a lot of reasons, in particular when I want to look at weeks or quarters.
I was thinking the right way to handle this was to have columns of type ts
so each row would be just be store, dollars_ts, transactions_ts, traffic_ts
but (a) how do I get the data into that format and (b) can ts
be combined the way integers can to give me the results I want? If you can only answer (a) or (b) but not both, please do answer what you can.
I cannot provide a realistic data set, but you can generate a random one to play with like this:
require("data.table")
storeData <- CJ(store = toupper(letters), date = seq(as.Date('2012-01-01'), as.Date('2014-01-01'), by="day"))
storeData$dollars = sample(100:100000, nrow(storeData), replace = TRUE)/100
storeData$transactions <- sample(0:1000, nrow(storeData), replace = TRUE)
storeData$traffic <- storeData$transactions + sample(0:1000, nrow(storeData), replace = TRUE)
head(storeData)
store date dollars transactions traffic
1: A 2012-01-01 48.60 409 990
2: A 2012-01-02 996.89 36 428
3: A 2012-01-03 69.35 647 1103
4: A 2012-01-04 334.56 953 973
5: A 2012-01-05 692.99 958 1753
6: A 2012-01-06 973.32 724 1086
I want to answer questions like "how many stores had positive dollar sales growth?" and "is there a relationship between change in dollars/transaction and change in traffic?" and to bin the data into time periods and compare the answers across time periods (e.g. Q1 this year versus Q1 last year).
Can these kinds of questions be answered using ts
? If so, how do I get this data into an appropriate set of columns or is there some structure other than data.table
I should be using?
Please show both how to organize the data and then how to use the data to answer the example questions "how many stores had positive dollar sales growth in January 2014 compared to January 2013?" and "what is the overall trend in dollars per transaction for the past 3 months?"
Upvotes: 6
Views: 6842
Reputation: 3280
You may want to look into the zoo
package. This package uses zoo
class which is able to handle irregular time series very well. It also has a zooreg
class, which is analogous to ts
class, but for this answer we'll stick with zoo
.
Here is a quick solution:
Step0: Loading the package and data:
# install.packages("zoo")
library(zoo)
storeData <- data.frame()
st.dates <- seq(as.Date('2012-01-01'),as.Date('2014-01-01'),by="day")
n <- length(st.dates)
storeData <-
data.frame(
store = rep(1:26, n),
dollars = sample(100:100000, n*26, replace = TRUE)/100,
transactions = sample(0:1000, n*26, replace = TRUE),
traffic = sample(0:2000, n*26, replace = TRUE)
)
Note that I am using a data.frame
instead of data.table
, and the construction of data is slightly different. With your level of expertise, I am sure you can manipulate it easily.
Step1: Converting the data to time series
Now, in principle you can convert this to zoo
or zooreg
class right now:
zoo(storeData, order.by=rep(st.dates,26))
But note that each day will have multiple entries corresponding to different stores, so this creates problem with multiple entries on the time series.
Since in our case, we won't have more than one entry per day per store, we want to keep the info for each store separate. So we split the data by store:
storeDataList <- split(storeData,factor(storeData$store))
Now we are ready to convert the data into time series:
storeDataZooList <- lapply(storeDataList,
function(storeData) zoo(storeData, order.by=st.dates))
This gives a list of time series data for all the stores.
Step3: Analysis
Now that you have time series data for all the stores, aggregating is straight forward:
You can aggregate a specific field for any store by quarter or month:
aggregate(storeDataZooList[[1]]$dollars,as.yearqtr)
aggregate(storeDataZooList[[1]]$dollars,as.yearmon)
You can aggregate all the fields for any store by quarter or month:
aggregate(storeDataZooList[[1]],as.yearqtr)
aggregate(storeDataZooList[[1]],as.yearmon)
You can aggregate all the fields for all the stores by quarter or month:
lapply(storeDataZooList,aggregate, as.yearmon)
lapply(storeDataZooList,aggregate, as.yearqtr)
The documentation for the package is fairly detailed, and I am sure it will help you tremendously.
Hope this helps!!
EDIT: Note that for simplicity I have kept the storeID as numeric. If you have an alpha-numeric storeID, you will have to drop the storeID at the time series conversion step for aggregation to work:
storeDataZooList <- lapply(storeDataList,
function(storeData) zoo(storeData[,-1], order.by=st.dates))
Upvotes: 5
Reputation: 18420
Try the package TimeProjection
to extract useful features from the date and aggregate on these derived features. In other words, stick to data.frame
or data.table
unless you want to run algorithms that need ts
structure.
Upvotes: 1
Reputation: 21625
You're asking a lot of questions here. I recommend you spend time reading about all the things data.table can do involving joins and aggregating data. Here is an example of how you would get the year over year growth of each store in the first quarter.
#get the first day of the first month for your binning
minDate<-min(storeData$date); month(minDate)<-1; day(minDate)<-1
#get the first day of the last month for your binning
maxDate<-max(storeData$date); month(maxDate)<-12; day(maxDate)<-1
#Build some bins
yearly<-data.table(leftBound=seq.Date(minDate,maxDate,by="year"))
quarterly<-data.table(leftBound=seq.Date(minDate,maxDate,by="3 months"))
monthly<-data.table(leftBound=seq.Date(minDate,maxDate,by="month"))
#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]
setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")
temp<-quarterly[storeData, roll=TRUE] #associate each (store, date) pair with a quarter
#create a "join table" containing each quarter for each store
jt<-CJ(leftBound=quarterly$leftBound, store=unique(storeData$store))
setkey(temp,"leftBound","store")
dt<-temp[jt, allow.cartesian=TRUE]
dt[, `:=`(year=year(leftBound), quarter=quarter(leftBound))]
qSummary<-dt[,list(dollars=sum(dollars, na.rm=TRUE),
transactions=sum(transactions, na.rm=TRUE),
traffic=sum(traffic, na.rm=TRUE)),
by=list(year,quarter,store)] #Summarize the data by quarter
#Get year/year growth for Q1
qSummary[,list(dollarGrowth = dollars[which(year==2014 & quarter==1)] / dollars[which(year==2013 & quarter==1)]), by=store]
#First five rows...
store dollarGrowth
1: A 0.0134860
2: B 0.0137215
3: C 0.0188249
4: D 0.0163887
5: E 0.0037576
Upvotes: 7