Dragos Geornoiu
Dragos Geornoiu

Reputation: 557

Inaccurate predictions with Poisson Regression in R

I am trying to make a prediction on the number of visitors of a website based on historic data collected. I think this is a scenario in which I could use Poisson Regression.

The input consists of 6 columns:

id(the id of the website), day, month, year, day of week, visits.

So basically as input we have a CSV with columns in the format: "2","22", "7", "2015", "6","751".

I am trying to predict the visits based on previous number of visits. The size of the websites can vary, so I ended up dividing them in 5 categories

So I made a 7th column named type which is a int ranging from 1 to 5.

My code is as follows:

train = read.csv("train.csv", header = TRUE)
model<-glm(visits ~ type + day + month + year + dayofweek, train, family=poisson)
summary(model)
P = predict(model, newdata = train)
imp = round(P)
imp

The values predicted are not even close, I taught I could end up with something in 10-20% of the actual values, but failed to do so, most of the values predicted are 200-300% bigger than the actual values. And this is on the train data set, which should provide an optimistic view.

I am new to R and having some problems interpreting the data returned by the summary command. This is what it returns:

Call: glm(formula = visits ~ type + day + month + year + dayofweek, family = poisson, data = train)

Deviance Residuals: Min 1Q Median 3Q Max
-571.05 -44.04 -11.33 -5.14 734.43

Coefficients:

            Estimate Std. Error  z value Pr(>|z|)     

(Intercept) -9.998e+02  6.810e-01 -1468.19   <2e-16 *** 

type         2.368e+00  1.280e-04 18498.53   <2e-16 *** 

day         -2.473e-04  6.273e-06   -39.42   <2e-16 *** 

month        1.658e-02  3.474e-05   477.31   <2e-16 *** 

year         4.963e-01  3.378e-04  1469.31   <2e-16 *** 

dayofweek   -3.783e-02  2.621e-05 -1443.46   <2e-16 ***

--- Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1239161821 on 12370 degrees of freedom Residual deviance: 157095033 on 12365 degrees of freedom AIC: 157176273

Number of Fisher Scoring iterations: 5

Could anyone describe in more detail the values returned by the summary command and what should they look like in a Poisson Regression which would output better predictions? Are there any better approaches in R to a data which is based on a evolution over time of the value to be estimated?

LE. link to train.csv file.

Upvotes: 3

Views: 2818

Answers (1)

Richard Telford
Richard Telford

Reputation: 9923

Your problem is with the predict command. The default in predict.glm is to make predictions on the link scale. If you want predictions that you can directly compare with the original data, you need to use the argument type = "response"

P <- predict(model, newdata = train, type = "response")

The model set up is not ideal. Perhaps month should be included as a categorical variable (as.factor) and you need to think more about day (day 31 of month is followed by day 1 of the next month). The predictor "type" is also dubious as type is derived directly from the response.

Your model is also highly over-dispersed. This might indicate missing predictors or other problems.

You should also think about using a mixed effect model.

Upvotes: 4

Related Questions