Reputation: 63
The Pandas .corr()
function allows me to obtain correlation coefficients between features. I am searching for an efficient way of calculating the correlations coefficients when I have multiple conditions to be satisfied.
In my case, I have a dataframe in which each row corresponds to a certain area that has been isolated with a fence. Columns in the dataframe are area extent and fence materials (simplified example below). I'd like to calculate the correlation matrix for areas and each material whenever its value is not zero. For instance, with
df = pd.DataFrame({'Area': [0.5, 4, 2, 1], 'Wire_rows': [3, 9, 5, 0], 'Columns': [4, 16, 0, 5]})
then df.corr().loc['Area', :]
gives me the correlation between area and 'Wire_rows' and 'Columns'. If I want to have this calculation excluding zero values for the materials I'd have to write something such as
df[df['Wire_rows'] > 0].corr().loc['Area', 'Wire_rows']
df[df['Columns'] > 0].corr().loc['Area', 'Columns']
Obtaining a Correlation matrix would then require merging these individual parts.
In my real example there are over 15 materials columns and several rows, so I wonder if there is a better way of excluding zero values from individual calculations.
Upvotes: 0
Views: 1119
Reputation: 891
Does this help?
cols=['Wire_rows','Columns']
d={}
for col in cols:
d[col]={0:np.nan}
df.replace(d).corr().loc['Area']
Upvotes: 2