tgmjack
tgmjack

Reputation: 42

how to confirm a correlation between features?

I have data showing the price to lease different cars. i have created a matrix to show the correlations between each of the elements involved but i do not trust it. in my experience the correlations it is showing should not be. the blp (the cost to fully purchase the car) should be the most important factor, however im getting seats and engine volume. (engine volume i can understand, but seats?) perhaps the problem may be how i scaled my data.

correlation matrix image

from matplotlib import pyplot
import pandas as pd
import numpy

from sklearn import *

def scale_this_data(data, col_names):

    print("scalling data now")
    new_df = pd.DataFrame(columns = col_names)
    for col in data.columns:
        wanted_col = False
        for the_col in col_names:
            if the_col == col:
                wanted_col = True
        if wanted_col == True:
            np_arr = data[col].values
            np_arr = np_arr.reshape(-1, 1)
            min_max_scaler = preprocessing.MinMaxScaler()
            
            np_arr = min_max_scaler.fit_transform(np_arr)
            
            #for n in range(len(data[col])):
            old = data[col].iloc[3]
            data[col] = np_arr
            print(str(data[col].iloc[3])+ "   this   became    this    = "+ str(data[col]))
    return data
    
Path = "new_ratebook.csv"
col_names = ['Net Rental2','Doors2', 'Seats2', 'BHP2', 'Eng CC2', 'CO22',  'blp2']
data = pd.read_csv(Path , dtype = str , index_col=False, low_memory=False)


data = scale_this_data(data, col_names)
data.to_csv("scaleddata.csv")

correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=0, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,7,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(col_names)
ax.set_yticklabels(col_names)
pyplot.savefig('correlations.png')
pyplot.show()

question, how do i confirm to myself the the correlation is correct

Upvotes: 0

Views: 280

Answers (1)

Konstantinos
Konstantinos

Reputation: 4406

You can confirm it with various ways. Some are the following:

  1. Verify that data are correct.
  2. Take some of your data (reduce the length of your data frame).
  3. Calculate it by hand (good way to convince one's self), calculator, excel or an online correlation coefficient calculator like the Pearson Correlation Coefficient Calculator from google results.

By the way, correlation does not imply effect/causality (link, archived).

Upvotes: 0

Related Questions