reduction of model accuracy while using PCA for a regression problem

Question

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model

Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).

I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70

I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.

So I have few questions regarding the same.

Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?

code


from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')

dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24

import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])

cat_cols = ['demand', 'from_ind', 'to_ind']

cyc_cols = ['time','weekday','month','monthday']

def cyclic_encode(data,col,col_max):
    data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
    data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
    return data 

cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)

dataset = dataset.drop(columns=cyc_cols)


ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)


# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)

y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)


#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)

# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)


#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)

Thanks

reduction of model accuracy while using PCA for a regression problem

Answers (1)

Related Questions