Reputation: 163
I am trying to run Spark MLlib packages in pyspark with a test machine learning data set. I am splitting the data sets into half training data set and half test data set. Below is my code that builds the model. However, it shows weight of NaN, NaN.. across all dependent variables. Couldn't figure out why. But it works when I try to standardize the data with the StandardScaler function.
model = LinearRegressionWithSGD.train(train_data, step = 0.01)
# evaluate model on test data set
valuesAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)))
Thank you very much for the help.
Below is the code that I used to do the scaling.
scaler = StandardScaler(withMean = True, withStd = True).fit(data.map(lambda x:x.features))
feature = [scaler.transform(x) for x in data.map(lambda x:x.features).collect()]
label = data.map(lambda x:x.label).collect()
scaledData = [LabeledPoint(l, f) for l,f in zip(label, feature)]
Upvotes: 3
Views: 634
Reputation: 2334
Try scaling the features
StandardScaler standardizes features by scaling to unit variance and/or removing the mean using column summary statistics on the samples in the training set. This is a very common pre-processing step.
Standardization can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training. Since you have some variables that are large numbers (eg: revenue) and some variables that are smaller (eg: number of clients), this should solve your problem.
Upvotes: 0