David A
David A

Reputation: 65

Are intermediate results piped using _ in chained operations available to subsequent functions in the chain?

I am creating a correlation matrix from which I want to obtain the max positive correlation value. Applying max() to the corr() results will just return 1.0 for the correlations along the axis, which is not desired, and so the objective is to remove all occurrences of 1.0 and then run the max(). I was thinking to do this in a chained operation, and can do it using _ to pipe intermediate results to the where() operation, which does turn 1.0 into NaNs. However, applying max() as the next operation in the chain still returns 1.0 as though it is ignoring the results of the where().

Is there something I'm not understanding with the _ operator? Or perhaps where() is the wrong function in this context? I have provided full code below to reproduce the question.

# Set up the problem

import pandas as pd
import numpy as np

# raw data

raw_t = [
66.6, 36.4, 47.6, 17.0, 54.6, 21.0, 12.2, 13.6, 20.6, 55.4, 63.4, 69.0,
80.2, 26.2, 42.6, 31.8, 15.6, 27.8, 13.8, 22.0, 14.2, 62.6, 96.4, 113.8,
115.2,82.2, 65.0, 23.2, 24.0, 14.2,  1.4,  3.8, 16.4, 16.4, 67.0, 51.4
]

# raw indexes

yr_mn = (np.full(12, 2000).tolist() + np.full(12, 2001).tolist() + np.full(12, 2002).tolist(),
np.arange(1,13).tolist() + np.arange(1,13).tolist() + np.arange(1,13).tolist() )

# structure multi index

index_base = list(zip(*yr_mn))
index = pd.MultiIndex.from_tuples(index_base, names=["year", "month"])

# create indexed dataset

t_dat = pd.Series(raw_t, index=index)

# example of the correlation matrix we are working with

pd.set_option("format.precision", 2)
t_dat.unstack().corr().style.background_gradient(cmap="YlGnBu")

And my attempts:


t_dat.unstack().corr().stack().where(_!=1.0) # does swap out 1.0 for NaN  
t_dat.unstack().corr().stack().where(_!=1.0).max() # still returns 1.0

Another point is that it will sometimes work, but sometimes it doesn't, returning ValueError: Array conditional must be same shape as self

This also makes me suspicious that I am missing something. The default setting of panda's max() is to skip NaNs, so it shouldn't have anything to do with that. I also tried setting the 1.0 to 0.0 using where(_!=1.0,0.0); same result. Also, I found the ValueError can be overcome if I rem out the where and rerun, as:


t_dat.unstack().corr().stack()#.where(\_!=1.0)

This somehow resets it, even though the original dataframe is not being altered.

Thanks for any insights! David

Upvotes: 1

Views: 49

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

Don't use _ in interactive environments - this contains the result of last command (it could work but eventually it will break).

You can do this:

# store the result to a variable:
result = t_dat.unstack().corr().stack()

# compute the boolean mask and set the True values to NaN
mask = result == 1.0
result[mask] = np.nan

print(result)

Prints:


...
11     1       -0.148800
       2       -0.561202
       3       -0.595797
       4        0.945831
       5       -0.737437
       6        0.812018
       7        0.516614
       8        0.785324
       9       -0.823919
       10       0.539078
       11            NaN
       12       0.929903
12     1       -0.502081
       2       -0.826288
       3       -0.849431
       4        0.760119
       5       -0.437322
       6        0.969761
       7        0.795323
       8        0.957978
       9       -0.557725
       10       0.811077
       11       0.929903
       12            NaN
dtype: float64

Then you can compute the max:

print(result.max())

Prints:

0.9996502197746994

Upvotes: 1

Related Questions