Batuhan B
Batuhan B

Reputation: 1855

Normalize Time Series - Scikit

I have:

  1. 3 wikipedia article access counts (weekly) (A-B-C)
  2. Ground truth data (weekly)
  3. Total wikipedia english article traffic counts (weekly)

My purpose is, build a multiple linear regression with 3 wikipedia article access counts and try to predict future ground truth data.

Before start to build multiple linear regression, I want to make some pre processing( normalization or scaling ) on my 3 wikipedia access count data.

My data format is like this.

    date     | A (x1)     | B (x2)  |  C (x3) | total_en     | ground truth(y)

 01/01/2008  |   5611     |   606   |    376  |  1467923911  | 3.13599886
 08/01/2008  |   8147     |   912   |    569  |  1627405409  | 2.53335614
 15/01/2008  |   9809     |   873   |    597  |  1744099880  | 2.91287713
 22/01/2008  |   12020    |   882   |    600  |  1804646235  | 3.44497102  
 ...         |    ...     |   ...   |    ...  |    ...       | ...

Without normalization I build my multiple linear regression like this.

wiki3.shape = (150,3) // include A-B-C article with numpy array

ground_truth = (150,1) // include ground truth data in numpy array

X_train, X_test, y_train, y_test = cross_validation.train_test_split(wiki3, ground_truth, test_size=0.3, random_state=1)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

My question is for better results How can I normalize/scale my x1,x2,x3 and y data ?

Should I normalize each article with the total english article traffic or should I use another way ?

Is K-Fold cross validation sensible for time-series ?

Thanks.

Upvotes: 0

Views: 1284

Answers (1)

selwyth
selwyth

Reputation: 2497

To scale your data, you can use sklearn.preprocessing.scale. If date is your index, it's as simple as wiki3_scaled = scale(wiki3) (if not, then date would also be scaled, which you likely don't want).

Normalizing with total_en is a modeling decision. If you have reason to believe A / total_en is a better feature than A, then go for it. Better yet, try both.

If you're trying to predict ground truth from same-day A, B and C, then it's not really a time-series problem and k-Fold cross-validation is certainly sensible. If you're trying to predict a future ground truth from today's A, B, C, ground truth and maybe the respective lagged variables, then I don't see why you can't cross-validate either; just be careful to set it up such that you train on history and cross-validate against the future.

You might get better answers on these modeling decisions on Cross Validated, since StackOverFlow is more programming-focused.

Upvotes: 1

Related Questions