sPaul
sPaul

Reputation: 459

Prediction Algorithms: Time Series

Lets say that for the past few months we have been selling 1000 different products. We log the "performance" of each product (i.e. how much money it generates) every 5 minutes. A day has 288 segments of 5 minutes. Our log looks like this:

prod_1 | 2013-03-28 | 1 | 0
prod_1 | 2013-03-28 | 2 | 9.90
prod_1 | 2013-03-28 | 3 | 19.80
prod_1 | 2013-03-28 | 4 | 19.80
...
prod_1 | 2013-03-28 | 287 | 2326.5
prod_1 | 2013-03-28 | 288 | 2326.5

So, on 28th March we sold 235 units of prod_1 and we can draw the curve of the product's progress throughout the day. Each product/date pair is our unique object, i.e. we do not connect different days of selling the same product. We have the same data for all of the products.

Lets say on 2013-03-29 we add a new product - prod_1001. The last line in our log for this product reads:

prod_1001 | 2013-03-29 | 153 | 804,6

Question: what machine algorithm should we use to predict the revenue that this specific product will have generated at the end of the day?

prod_1001 | 2013-03-29 | 288 | ???

Upvotes: 0

Views: 1662

Answers (2)

Julian Ortega
Julian Ortega

Reputation: 947

Without being an expert, my feeling is that this is a time series problem, and as far as I know, Mahout doesn't have anything specific for doing time series (I mention this because you tagged the question as Mahout).

These links from mailing lists should provide some light into the matter: link1, link2. They are from 2011, but I think they information still holds true.

The basic gist, is that Mahout doesn't have it, but you could implement such a thing and contribute to the project or use a better suited statistical software for the task like R (link)

Upvotes: 0

Ben Allison
Ben Allison

Reputation: 7394

This isn't an algorithm, but I'd make the following suggestions about the kind of model you might use:

  • One possible model is that each time slice has an independent number of sales in it. It's probably appropriate to model this as Poisson-distributed. The money generated in this period is units * sale price.
  • In such a model, all observations for prod_1001 provide the likelihood function for the Poisson parameter. The maximum likelihood estimator is the mean number of unit sales in all observed time slices. Given this estimate, you have a predictive distribution over the number of units you will sell in some new time slice
  • To make the prediction for the rest of the day, multiply the Poisson parameter by the number of time slices left in the day. This gives you a distribution over the number of units you'll sell in the rest of the day. The expectation of this distribution is the Poisson parameter itself, but you might be interested in other quantities.
  • Multiply this by the unit price to get the money you'll make in the rest of the day

So: if you'd see a mean of 4 units sold in timeslices for prod_1001 so far today, your distribution over how many you'll sell in the next time is Poisson(4). If the product sells for £4.99, your expected revenue in the next timeslice is £19.96, you have less than a 5% chance of making more than 8*£4.99 = £39.94 etc etc. If there are 50 timeslices left today, then you expect to make 50*4*£4.99=£998 more today.

You might ask how to incorporate the knowledge gleaned from the other other products: my instinct as to the easiest way to do this is to use them to estimate an Empirical Bayes prior on the Poisson parameter. This means estimating the two parameters of a Gamma distribution on the Poisson rate, and a simple criterion for that would be to maximise the likelihood of the observations on the other 1000 products. Given this prior, you do Bayesian inference on the Poisson distribution for product 1001, which is pretty straightforward as the posterior predictive distribution has closed form.

Upvotes: 2

Related Questions