Reputation: 1855
I have:
My purpose is, build a multiple linear regression with 3 wikipedia article access counts and try to predict future ground truth data.
Before start to build multiple linear regression, I want to make some pre processing( normalization or scaling ) on my 3 wikipedia access count data.
My data format is like this.
date | A (x1) | B (x2) | C (x3) | total_en | ground truth(y)
01/01/2008 | 5611 | 606 | 376 | 1467923911 | 3.13599886
08/01/2008 | 8147 | 912 | 569 | 1627405409 | 2.53335614
15/01/2008 | 9809 | 873 | 597 | 1744099880 | 2.91287713
22/01/2008 | 12020 | 882 | 600 | 1804646235 | 3.44497102
... | ... | ... | ... | ... | ...
Without normalization I build my multiple linear regression like this.
wiki3.shape = (150,3) // include A-B-C article with numpy array
ground_truth = (150,1) // include ground truth data in numpy array
X_train, X_test, y_train, y_test = cross_validation.train_test_split(wiki3, ground_truth, test_size=0.3, random_state=1)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
My question is for better results How can I normalize/scale my x1,x2,x3 and y data ?
Should I normalize each article with the total english article traffic or should I use another way ?
Is K-Fold cross validation sensible for time-series ?
Thanks.
Upvotes: 0
Views: 1284
Reputation: 2497
To scale your data, you can use sklearn.preprocessing.scale
. If date
is your index
, it's as simple as wiki3_scaled = scale(wiki3)
(if not, then date
would also be scaled, which you likely don't want).
Normalizing with total_en
is a modeling decision. If you have reason to believe A / total_en
is a better feature than A
, then go for it. Better yet, try both.
If you're trying to predict ground truth
from same-day A
, B
and C
, then it's not really a time-series problem and k-Fold cross-validation is certainly sensible. If you're trying to predict a future ground truth
from today's A
, B
, C
, ground truth
and maybe the respective lagged variables, then I don't see why you can't cross-validate either; just be careful to set it up such that you train on history and cross-validate against the future.
You might get better answers on these modeling decisions on Cross Validated, since StackOverFlow is more programming-focused.
Upvotes: 1