BhishanPoudel
BhishanPoudel

Reputation: 17164

Python pandas: how to select the rows when sign changes and have miminum of the values?

I was trying to find where the function crosses the line x=0. I utilized the fact that when the function crosses x-axis, its sign changes.

Now, I have a dataframe like this, I want to find the TWO rows that are closest to zeros, given that function crosses the x-axis at two points.

     A     value
0  105  0.662932
1  105  0.662932
2  107  0.052653 # sign changes here when A is 107
3  108 -0.228060 # among these two A 107 is closer to zero
4  110 -0.740819
5  112 -1.188906
6  142 -0.228060 # sign changes here when A is 142
7  143  0.052654 # among these two, A 143 is closer to zero
8  144  0.349638

Required output:

     A     value
2  107  0.052653 
7  143  0.052654 

Upvotes: 1

Views: 2454

Answers (3)

BhishanPoudel
BhishanPoudel

Reputation: 17164

I managed get a simple solution:

import numpy as np
import pandas as pd

data = [
    [105,  0.662932],
    [105,  0.662932],
    [107,  0.052653], # sign changes between here
    [108, -0.228060], # and here; first row has `value` closer to 0
    [110, -0.740819],
    [112, -1.188906],
    [142, -0.228060], # sign changes between here
    [143,  0.052654], # and here; second row has `value` closer to 0
    [144,  0.349638],
]

df = pd.DataFrame(data, columns=["A", "value"]

Solution

def find_closest_to_zero_idx(arr):
    fx = np.zeros(len(arr))
    fy = np.array(arr)

    # lower index when sign changes in array
    idx = np.argwhere((np.diff(np.sign(fx - fy)) != 0) )
    nearest_to_zero = []

    # test two values before and after zero which is nearer to zero
    for i in range(len(idx)):
        if abs(arr[idx[i][0]]) < abs(arr[idx[i][0]+1]):
            nearer = idx[i][0]
            nearest_to_zero.append(nearer)
        else:
            nearer = idx[i][0]+1
            nearest_to_zero.append(nearer)


    return nearest_to_zero

idx = find_closest_to_zero_idx(df.value.values)

Result

idx = find_closest_to_zero_idx(df.value.values)

df.loc[idx]

     A     value
2  107  0.052653
7  143  0.052654

Slow but pure pandas method

df['value_shifted'] = df.value.shift(-1)
df['sign_changed'] = np.sign(df.value.values) * np.sign(df.value_shifted.values)

# lower index where sign changes
idx = df[df.sign_changed == -1.0].index.values

# make both lower and upper index from the a-axis negative so that
# we can groupby later.
for i in range(len(idx)):
    df.loc[ [idx[i], idx[i]+1], 'sign_changed'] = -1.0 * (i+1)

df1 = df[ np.sign(df.sign_changed) == -1.0]
df2 = df1.groupby('sign_changed')['value'].apply(lambda x: min(abs(x)))
df3 = df2.reset_index()

answer = df.merge(df3,on=['sign_changed','value'])
answer
     A     value  value_shifted  sign_changed
0  107  0.052653      -0.228060          -1.0
1  143  0.052654       0.349638          -2.0

Upvotes: 1

Nathan
Nathan

Reputation: 10306

import pandas as pd

data = [
    [105,  0.662932],
    [105,  0.662932],
    [107,  0.052653], # sign changes between here
    [108, -0.228060], # and here; first row has `value` closer to 0
    [110, -0.740819],
    [112, -1.188906],
    [142, -0.228060], # sign changes between here
    [143,  0.052654], # and here; second row has `value` closer to 0
    [144,  0.349638],
]

df = pd.DataFrame(data, columns=["A", "value"])

# where the sign is the same between two elements, the diff is 0
# otherwise, it's either 2 or -2 (doesn't matter which for this use case)
# use periods=1 and =-1 to do a diff forwards and backwards

sign = df.value.map(np.sign)
diff1 = sign.diff(periods=1).fillna(0)
diff2 = sign.diff(periods=-1).fillna(0)

# now we have the locations where sign changes occur. We just need to extract
# the `value` values at those locations to determine which of the two possibilities
# to choose for each sign change (whichever has `value` closer to 0)

df1 = df.loc[diff1[diff1 != 0].index]
df2 = df.loc[diff2[diff2 != 0].index]
idx = np.where(abs(df1.value.values) < abs(df2.value.values), df1.index.values, df2.index.values)
df.loc[idx]
    A   value
2   107 0.052653
7   143 0.052654

Thanks to @Vince W. for mentioning one should use np.where; I was going with a more convoluted approach initially.

EDIT - see @useruser3483203's answer below which is much faster than this. You can improve even a bit more (2x as fast when I reran their timings) by doing the first few operations (diff, abs, compare equality) on a numpy array instead of a pandas Series too. numpy's diff is different than the one in pandas, though, in that it drops the first element instead of returning NaN for it. This means we get back the index of the first row of the sign change, instead of the second, and need to add one to get the next row.

def find_min_sign_changes(df):
    vals = df.value.values
    abs_sign_diff = np.abs(np.diff(np.sign(vals)))
    # idx of first row where the change is
    change_idx = np.flatnonzero(abs_sign_diff == 2)
    # +1 to get idx of second rows in the sign change too
    change_idx = np.stack((change_idx, change_idx + 1), axis=1)

    # now we have the locations where sign changes occur. We just need to extract
    # the `value` values at those locations to determine which of the two possibilities
    # to choose for each sign change (whichever has `value` closer to 0)

    min_idx = np.abs(vals[change_idx]).argmin(1)
    return df.iloc[change_idx[range(len(change_idx)), min_idx]]

Upvotes: 3

user3483203
user3483203

Reputation: 51155

You can generalize the approach using numpy:

a = df.value.values
u = np.sign(df.value)
m = np.flatnonzero(u.diff().abs().eq(2))

g = np.stack([m-1, m], axis=1)
v = np.abs(a[g]).argmin(1)

df.iloc[g[np.arange(g.shape[0]), v]]

     A     value
2  107  0.052653
7  143  0.052654

This solution is also going to be a lot more efficient, especially as the size scales.

In [122]: df = pd.concat([df]*100)

In [123]: %timeit chris(df)
870 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [124]: %timeit nathan(df)
2.03 s ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [125]: %timeit df.loc[find_closest_to_zero_idx(df.value.values)]
1.81 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 3

Related Questions