Neil
Neil

Reputation: 8247

correlation in pandas with groupby

I have following dataframe in pandas

 code    tank     var     nozzle_1    nozzle_2     nozzle_3    nozzle_tank
 123     1        23.34   12.23       54.56        12.22       11 
 123     1        22.32   11.32       7.89         3.45        12 
 123     1        21.22   19.93       5.54         5.66        12
 123     1        21.34   12.23       54.56        22.22       14
 123     1        32.32   13.32       4.89         32.45       34
 123     1        32.22   29.93       23.54        23.66       33
 123     2        23.34   12.23       54.56        12.22       21
 123     2        22.32   11.32       7.89         3.45        22
 123     2        21.22   19.93       5.54         5.66        21
 123     2        21.34   12.23       54.56        22.22       21
 123     2        32.32   13.32       4.89         32.45       22
 123     2        32.22   29.93       23.54        23.66       21  

I want to calculate correlation of nozzle_1, nozzle_2, nozzle_3 and nozzle_4 with var column grouping over tank

My desired dataframe would be

 code   tank    nozzle_1    nozzle_2    nozzle_3    nozzle_4    
 123    1       0.08        0.01        0.02        0.01
 123    2       0.07        0.01        0.02        0.02

I am doing following in pandas

cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
cols= np.array(cols)
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
        final_df = pd.DataFrame()
        for col in nozzles_to_scale:
            corrs = (df[[col, tank]].groupby(tank).corrwith(df.var_col ).reset_index())
            final_df = final_df.join(corrs)
        return final_df

But it does not seem to work, how can we do it in pandas?

    test =  corrVar(df, cols, var, tank)

Upvotes: 1

Views: 361

Answers (1)

jezrael
jezrael

Reputation: 862451

You can use:

cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
        final_df = [df[[col, tank]].groupby(tank).corrwith(df[var_col]) for col in cols]
        return pd.concat(final_df, axis=1)

print (corrVar(df, cols, var_col, tank))
      nozzle_1  nozzle_2  nozzle_3
tank                              
1     0.501164 -0.309435  0.761017
2     0.501164 -0.309435  0.761017

EDIT: Solution for correlation of each N values per groups:

N = 3
g = df.groupby('tank').cumcount() // N

cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)] 
var_col = 'var'
tank = 'tank'
code = 'code'
def corrVar(df, cols, var_col, tank, g):
        #https://stackoverflow.com/a/48570300
        final_df = [df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col])) 
                            for col in cols]
        return pd.concat(final_df, axis=1, keys=cols)

print (corrVar(df, cols, var_col, tank, g))
        nozzle_1  nozzle_2  nozzle_3
  tank                              
0 1    -0.826376  0.876202  0.703793
  2    -0.826376  0.876202  0.703793
1 1     0.540176 -0.931286  0.614626
  2     0.540176 -0.931286  0.614626

Testing groups:

print (df.assign(groups=g))
    code  tank    var  nozzle_1  nozzle_2  nozzle_3  nozzle_tank  groups
0    123     1  23.34     12.23     54.56     12.22           11       0
1    123     1  22.32     11.32      7.89      3.45           12       0
2    123     1  21.22     19.93      5.54      5.66           12       0
3    123     1  21.34     12.23     54.56     22.22           14       1
4    123     1  32.32     13.32      4.89     32.45           34       1
5    123     1  32.22     29.93     23.54     23.66           33       1
6    123     2  23.34     12.23     54.56     12.22           21       0
7    123     2  22.32     11.32      7.89      3.45           22       0
8    123     2  21.22     19.93      5.54      5.66           21       0
9    123     2  21.34     12.23     54.56     22.22           21       1
10   123     2  32.32     13.32      4.89     32.45           22       1
11   123     2  32.22     29.93     23.54     23.66           21       1

EDIT:

Function should be one row:

def corrVar(df, cols, var_col, tank, g):
        return pd.concat([df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col])) 
                          for col in cols], axis=1, keys=cols)

Upvotes: 1

Related Questions