Reputation: 19
Hi so I have created a function to check the correlation between 2 variables, anyone knows how can I create a new data frame from this?
In [1]:from scipy.stats import pearsonr
for colY in Y.columns:
for colX in X.columns:
#print('Pearson Correlation')
corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
alpha = 0.05
print('Pearson Correlation', (alpha, corr))
if corr <= alpha:
print(colX +' and ' +colY+ ' two ariables are not correlated ')
else:
print(colX +' and ' +colY+ ' two variables are highly correlated ')
print('\n')
print('\n')
here's a sample output from the correlation function:
Out [1]:
Pearson Correlation (0.05, -0.1620045985125294)
banana and orange are not correlated
Pearson Correlation (0.05, 0.2267582070839226)
apple and orange are highly correlated
```
Upvotes: 1
Views: 1430
Reputation: 12417
I think you are looking for this: This will get a column-wise correlation of every two pairs of columns between X and Y dataframes and create another dataframe that keeps all the correlations and whether they pass a threshold alpha: This assumes Y has less or equal number of columns as X. If not simply switch X and Y places:
import collections
corr_df = pd.DataFrame(columns=['col_X', 'col_Y', 'corr', 'is_correlated'])
d = collections.deque(X.columns)
Y_cols = Y.columns
alpha = 0.05
for i in range(len(d)):
d.rotate(i)
X = X[d]
corr = Y.corrwith(X, axis=0)
corr_df = corr_df.append(pd.DataFrame({'col_X':list(d)[:len(Y_cols)], 'col_Y':Y.columns, 'corr':corr[:len(Y_cols)], 'is_correlated':corr[:len(Y_cols)]>alpha}))
print(corr_df.reset_index())
sample input and output:
X:
A B C
0 2 2 10
1 4 0 2
2 8 0 1
3 0 0 8
Y:
B C
0 2 10
1 0 2
2 0 1
3 0 8
correlation(X, Y):
col_X col_Y corr is_correlated
0 A B 1.0 True
1 B C 1.0 True
2 C B 1.0 True
3 A C 1.0 True
4 A B 1.0 True
5 B C 1.0 True
Upvotes: 0
Reputation: 1869
I would avoid using two for loops. Depending on the size of your dataset this will be very slow.
Pandas provides a correlation function with might come in hand here:
import pandas as pd
df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})
using corr() will give you the pairwise correlations then and returns a new dataframe as well:
df.corr()
For more infos you can check the manual: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
Upvotes: 2
Reputation: 1159
You can just do the following.
df = pd.DataFrame(index=X.columns, columns=Y.columns)
#In your loop
df[colY][colX] = corr
Your loop would then be
for colY in Y.columns:
for colX in X.columns:
#print('Pearson Correlation')
corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
alpha = 0.05
print('Pearson Correlation', (alpha, corr))
df[colY][colX] = corr
if corr <= alpha:
print(colX +' and ' +colY+ ' two ariables are not correlated ')
else:
print(colX +' and ' +colY+ ' two variables are highly correlated ')
print('\n')
print('\n')
Upvotes: 0