user131983
user131983

Reputation: 3927

How to select rows in a DataFrame between two values

I am trying to modify a DataFrame df to only contain rows for which the values in the column closing_price are between 99 and 101 and trying to do this with the code below.

However, I get the error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

and I am wondering if there is a way to do this without using loops.

df = df[99 <= df['closing_price'] <= 101]

Upvotes: 183

Views: 356734

Answers (7)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

there is a nicer alternative - use query() method:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99

UPDATE: answering the comment (edited to fix minor mistakes):

I like the syntax here but fell down when trying to combine with expression:

df.query('(mean - 2*sd) <= closing_price <= (mean + 2*sd)')

My data is all within 2 standard deviations of the mean, so I'll do 1 instead to demonstrate:

In [161]: qry = ("(closing_price.mean() - closing_price.std())" +
     ...:        " <= closing_price <= " +
     ...:        "(closing_price.mean() + closing_price.std())")
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
1             99
2             98
5            101
6            101
7             99
9             96

or

In [163]: mean = df['closing_price'].mean()
     ...: sd = df['closing_price'].std()
     ...: df.query('(@mean - @sd) <= closing_price <= (@mean + @sd)')
     ...:
Out [163]:
   closing_price
1             99
2             98
5            101
6            101
7             99
9             96

Upvotes: 39

Rushabh Agarwal
Rushabh Agarwal

Reputation: 401

Instead of this

df = df[99 <= df['closing_price'] <= 101]

You should use this

df = df[(99 <= df['closing_price']) & (df['closing_price'] <= 101)]

We have to use NumPy's bitwise logic operators |, &, ~, ^ for compounding queries. Also, the parentheses are important for operator precedence.

For more info, you can visit the link: Comparisons, Masks, and Boolean Logic (excerpt from the Python Data Science Handbook by Jake VanderPlas).

Upvotes: 6

Parfait
Parfait

Reputation: 107567

Consider Series.between:

df = df[df['closing_price'].between(99, 101)]

Upvotes: 362

normanius
normanius

Reputation: 9762

If one has to call pd.Series.between(l,r) repeatedly (for different bounds l and r), a lot of work is repeated unnecessarily. In this case, it's beneficial to sort the frame/series once and then use pd.Series.searchsorted(). I measured a speedup of up to 25x, see below.

def between_indices(x, lower, upper, inclusive=True):
    """
    Returns smallest and largest index i for which holds 
    lower <= x[i] <= upper, under the assumption that x is sorted.
    """
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

# Sort x once before repeated calls of between()
x = x.sort_values().reset_index(drop=True)
# x = x.sort_values(ignore_index=True) # for pandas>=1.0
ret1 = between_indices(x, lower=0.1, upper=0.9)
ret2 = between_indices(x, lower=0.2, upper=0.8)
ret3 = ...

Benchmark

Measure repeated evaluations (n_reps=100) of pd.Series.between() as well as the method based on pd.Series.searchsorted(), for different arguments lower and upper. On my MacBook Pro 2015 with Python v3.8.0 and Pandas v1.0.3, the below code results in the following outpu

# pd.Series.searchsorted()
# 5.87 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pd.Series.between(lower, upper)
# 155 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Logical expressions: (x>=lower) & (x<=upper)
# 153 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numpy as np
import pandas as pd

def between_indices(x, lower, upper, inclusive=True):
    # Assumption: x is sorted.
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

def between_fast(x, lower, upper, inclusive=True):
    """
    Equivalent to pd.Series.between() under the assumption that x is sorted.
    """
    i, j = between_indices(x, lower, upper, inclusive)
    if True:
        return x.iloc[i:j]
    else:
        # Mask creation is slow.
        mask = np.zeros_like(x, dtype=bool)
        mask[i:j] = True
        mask = pd.Series(mask, index=x.index)
        return x[mask]

def between(x, lower, upper, inclusive=True):
    mask = x.between(lower, upper, inclusive=inclusive)
    return x[mask]

def between_expr(x, lower, upper, inclusive=True):
    if inclusive:
        mask = (x>=lower) & (x<=upper)
    else:
        mask = (x>lower) & (x<upper)
    return x[mask]

def benchmark(func, x, lowers, uppers):
    for l,u in zip(lowers, uppers):
        func(x,lower=l,upper=u)

n_samples = 1000
n_reps = 100
x = pd.Series(np.random.randn(n_samples))
# Sort the Series.
# For pandas>=1.0:
# x = x.sort_values(ignore_index=True)
x = x.sort_values().reset_index(drop=True)

# Assert equivalence of different methods.
assert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))
assert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))

# Benchmark repeated evaluations of between().
uppers = np.linspace(0, 3, n_reps)
lowers = -uppers
%timeit benchmark(between_fast, x, lowers, uppers)
%timeit benchmark(between, x, lowers, uppers)
%timeit benchmark(between_expr, x, lowers, uppers)

Upvotes: 6

sparrow
sparrow

Reputation: 11460

If you're dealing with multiple values and multiple inputs you could also set up an apply function like this. In this case filtering a dataframe for GPS locations that fall withing certain ranges.

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

Upvotes: 5

crashMOGWAI
crashMOGWAI

Reputation: 692

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

or

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')

Upvotes: 12

Jianxun Li
Jianxun Li

Reputation: 24742

You should use () to group your boolean vector to remove ambiguity.

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

Upvotes: 171

Related Questions