Reputation:
Really been on this for over a week now,worked with youtube likes prediction dataset. i had to drop all non textual features and non correlated features to the target remaining 3 features,and the dataset is just (26061,12) dataset. But using linear regression saw that my MSE was too huge and also the MAE(about 15,000). Also used gradientboosting still the same and also discovered it doesn't work for the dataset any value greater than 5 for the n_estimators. Also tried to transform the X_train and X_test using power transformer to ensure a good gaussian distribution but still didn't work. I can't figure what is really wrong. here's link to my colab notebook https://colab.research.google.com/drive/1dJZuG0n63842DEwHMR7TzLBmssnOKsj4?usp=sharing link to dataset https://www.kaggle.com/jinxzed/youtube-likes-prediction-av-hacklive
Upvotes: 0
Views: 140
Reputation: 108
The scales of your features are very different ('views' are numerically much larger than the other variables). This makes the 'views' feature to have higher influence to the final output than other variables do.
I'd recommend normalizing the features first before feeding the data into any model. You can use sklearn's StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Also, a large MSE doesn't necessarily mean your model is bad since how large MSE also depends on how large the label y is. For example, for the same 10% difference between true_label and predict_label,
true_label = 1000, predict_label = 1100 -> Squared Error = 10000
would result in a much larger Squared Error than
true_label = 1, predict_label = 1.1 -> Squared Error = 0.01
Upvotes: 1