chink
chink

Reputation: 1643

reduction of model accuracy while using PCA for a regression problem

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model

Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).

I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70

I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.

So I have few questions regarding the same.

  1. Why is PCA not helping and inturn reducing the performance of my model so badly
  2. Any suggestions on how to improve my model performance?

code


from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')

dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24

import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])

cat_cols = ['demand', 'from_ind', 'to_ind']

cyc_cols = ['time','weekday','month','monthday']

def cyclic_encode(data,col,col_max):
    data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
    data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
    return data 

cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)

dataset = dataset.drop(columns=cyc_cols)


ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)


# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)

y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)


#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)

# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)


#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)

Thanks

Upvotes: 0

Views: 722

Answers (1)

Priya
Priya

Reputation: 315

You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.

Here are few things you can try

  1. Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
  2. try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
  3. Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.

PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

Upvotes: 1

Related Questions