Reputation: 8247
I have following dataframe in pandas
code tank var nozzle_1 nozzle_2 nozzle_3 nozzle_tank
123 1 23.34 12.23 54.56 12.22 11
123 1 22.32 11.32 7.89 3.45 12
123 1 21.22 19.93 5.54 5.66 12
123 1 21.34 12.23 54.56 22.22 14
123 1 32.32 13.32 4.89 32.45 34
123 1 32.22 29.93 23.54 23.66 33
123 2 23.34 12.23 54.56 12.22 21
123 2 22.32 11.32 7.89 3.45 22
123 2 21.22 19.93 5.54 5.66 21
123 2 21.34 12.23 54.56 22.22 21
123 2 32.32 13.32 4.89 32.45 22
123 2 32.22 29.93 23.54 23.66 21
I want to calculate correlation of nozzle_1, nozzle_2, nozzle_3 and nozzle_4 with var column grouping over tank
My desired dataframe would be
code tank nozzle_1 nozzle_2 nozzle_3 nozzle_4
123 1 0.08 0.01 0.02 0.01
123 2 0.07 0.01 0.02 0.02
I am doing following in pandas
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
cols= np.array(cols)
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
final_df = pd.DataFrame()
for col in nozzles_to_scale:
corrs = (df[[col, tank]].groupby(tank).corrwith(df.var_col ).reset_index())
final_df = final_df.join(corrs)
return final_df
But it does not seem to work, how can we do it in pandas?
test = corrVar(df, cols, var, tank)
Upvotes: 1
Views: 361
Reputation: 862451
You can use:
cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
final_df = [df[[col, tank]].groupby(tank).corrwith(df[var_col]) for col in cols]
return pd.concat(final_df, axis=1)
print (corrVar(df, cols, var_col, tank))
nozzle_1 nozzle_2 nozzle_3
tank
1 0.501164 -0.309435 0.761017
2 0.501164 -0.309435 0.761017
EDIT: Solution for correlation of each N
values per groups:
N = 3
g = df.groupby('tank').cumcount() // N
cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
var_col = 'var'
tank = 'tank'
code = 'code'
def corrVar(df, cols, var_col, tank, g):
#https://stackoverflow.com/a/48570300
final_df = [df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col]))
for col in cols]
return pd.concat(final_df, axis=1, keys=cols)
print (corrVar(df, cols, var_col, tank, g))
nozzle_1 nozzle_2 nozzle_3
tank
0 1 -0.826376 0.876202 0.703793
2 -0.826376 0.876202 0.703793
1 1 0.540176 -0.931286 0.614626
2 0.540176 -0.931286 0.614626
Testing groups:
print (df.assign(groups=g))
code tank var nozzle_1 nozzle_2 nozzle_3 nozzle_tank groups
0 123 1 23.34 12.23 54.56 12.22 11 0
1 123 1 22.32 11.32 7.89 3.45 12 0
2 123 1 21.22 19.93 5.54 5.66 12 0
3 123 1 21.34 12.23 54.56 22.22 14 1
4 123 1 32.32 13.32 4.89 32.45 34 1
5 123 1 32.22 29.93 23.54 23.66 33 1
6 123 2 23.34 12.23 54.56 12.22 21 0
7 123 2 22.32 11.32 7.89 3.45 22 0
8 123 2 21.22 19.93 5.54 5.66 21 0
9 123 2 21.34 12.23 54.56 22.22 21 1
10 123 2 32.32 13.32 4.89 32.45 22 1
11 123 2 32.22 29.93 23.54 23.66 21 1
EDIT:
Function should be one row:
def corrVar(df, cols, var_col, tank, g):
return pd.concat([df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col]))
for col in cols], axis=1, keys=cols)
Upvotes: 1