Reputation: 42
I have data showing the price to lease different cars. i have created a matrix to show the correlations between each of the elements involved but i do not trust it. in my experience the correlations it is showing should not be. the blp (the cost to fully purchase the car) should be the most important factor, however im getting seats and engine volume. (engine volume i can understand, but seats?) perhaps the problem may be how i scaled my data.
from matplotlib import pyplot
import pandas as pd
import numpy
from sklearn import *
def scale_this_data(data, col_names):
print("scalling data now")
new_df = pd.DataFrame(columns = col_names)
for col in data.columns:
wanted_col = False
for the_col in col_names:
if the_col == col:
wanted_col = True
if wanted_col == True:
np_arr = data[col].values
np_arr = np_arr.reshape(-1, 1)
min_max_scaler = preprocessing.MinMaxScaler()
np_arr = min_max_scaler.fit_transform(np_arr)
#for n in range(len(data[col])):
old = data[col].iloc[3]
data[col] = np_arr
print(str(data[col].iloc[3])+ " this became this = "+ str(data[col]))
return data
Path = "new_ratebook.csv"
col_names = ['Net Rental2','Doors2', 'Seats2', 'BHP2', 'Eng CC2', 'CO22', 'blp2']
data = pd.read_csv(Path , dtype = str , index_col=False, low_memory=False)
data = scale_this_data(data, col_names)
data.to_csv("scaleddata.csv")
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=0, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,7,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(col_names)
ax.set_yticklabels(col_names)
pyplot.savefig('correlations.png')
pyplot.show()
question, how do i confirm to myself the the correlation is correct
Upvotes: 0
Views: 280
Reputation: 4406
You can confirm it with various ways. Some are the following:
By the way, correlation does not imply effect/causality (link, archived).
Upvotes: 0