Reputation: 31
I'm trying to determine how many days in a row a certain column is above or below a certain threshold.
>>> df.head()
Open High Low Close Volume
Date
2004-08-19 49.96 51.98 47.93 50.12 NaN
2004-08-20 50.69 54.49 50.20 54.10 NaN
2004-08-23 55.32 56.68 54.47 54.65 NaN
2004-08-24 55.56 55.74 51.73 52.38 NaN
2004-08-25 52.43 53.95 51.89 52.95 NaN
>>>
For the example above, I'd like another column df['RDA'] to increment for each consecutive day that the column Open is over 50. For each consecutive day below 50, I'd like a second column df['RDB'] to increment and df['RDA'] to reset to 0. I've tried if/then logic but it doesn't like that and gives me a value error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). how can i sort it out
I'd like the output of my dataframe to look like this:
>>> df.head()
Open High Low Close Volume RDA RDB
Date
2004-08-19 51.96 51.98 47.93 50.12 NaN 1 0
2004-08-20 50.69 54.49 50.20 54.10 NaN 2 0
2004-08-23 55.32 56.68 54.47 54.65 NaN 3 0
2004-08-24 45.56 55.74 51.73 52.38 NaN 0 1
2004-08-25 42.43 53.95 51.89 52.95 NaN 0 2
2004-08-26 41.96 51.98 47.93 50.12 NaN 0 3
2004-08-27 40.69 54.49 50.20 54.10 NaN 0 4
2004-08-28 55.32 56.68 54.47 54.65 NaN 1 0
2004-08-29 55.56 55.74 51.73 52.38 NaN 2 0
2004-08-30 52.43 53.95 51.89 52.95 NaN 3 0
>>>
Is this something that's possible with Pandas? I know you can get a count of the values in a column, but I've been so far unable to find a method for with consecutive values. An if/then statement with 2 variables would work, but like I mentioned above, I get a value error when I tried that. Any help would be appreciated.
Upvotes: 2
Views: 814
Reputation: 294488
np.sign
on the difference between Open
and 50
. It will be -1
when less than 50
, 0
when exactly 50
, and 1
when greater than 50
. np.diff
to identify when it switches from one value to another cumsum
to define the groups of consecutive signscumcount
to get counts within groups np.where
to split the cumcounts
o = df.Open.values - 50
signs = np.sign(o)
changes = np.append(False, signs[:-1] != signs[1:])
g = changes.cumsum()
cumcounts = df.groupby(g).cumcount() + 1
a = np.where(signs == 1, cumcounts, 0)
b = np.where(signs == -1, cumcounts, 0)
df.assign(RDA=a, RDB=b)
Open High Low Close Volume RDA RDB
Date
2004-08-19 51.96 51.98 47.93 50.12 NaN 1 0
2004-08-20 50.69 54.49 50.20 54.10 NaN 2 0
2004-08-23 55.32 56.68 54.47 54.65 NaN 3 0
2004-08-24 45.56 55.74 51.73 52.38 NaN 0 1
2004-08-25 42.43 53.95 51.89 52.95 NaN 0 2
2004-08-26 41.96 51.98 47.93 50.12 NaN 0 3
2004-08-27 40.69 54.49 50.20 54.10 NaN 0 4
2004-08-28 55.32 56.68 54.47 54.65 NaN 1 0
2004-08-29 55.56 55.74 51.73 52.38 NaN 2 0
2004-08-30 52.43 53.95 51.89 52.95 NaN 3 0
Upvotes: 2
Reputation: 653
This can also be done using the Python provided functools.reduce
method. First create an iterable of your target data, so in your case:
target = df.Open > 50
This will be what you pass into functools.reduce
later on to be "reduced". Reduce is essentially map
, but retains a value across list elements. This can be leveraged to do what you are asking.
I'll try and break down the function you could use (shown in full at the end of the post).
functools.reduce
gives you access to two parameters. Your accumulated value, and the current list item you are at. It also allows you to pass in your own initializer (first item before anything is looked at). With this we can go through our list, and if it is True
, as determined by our target series above, we can add 1 to the last element in the list, otherwise add a 0
to our accumulator.
This requires a little bit of finessing by setting the initializer to a list with the value 0
in it like [0]
, so that on the first pass it can take the "last" element and do something with it without erring out.
Once completed there will be that straggler 0
at the front of the list which you can just remove using a slice [1:]
to take only the second element and onwards.
Your RDB
column is the exact same, except you want to make sure it is NOT True
in your target list, which just requires adding a not
in your conditional statement.
The full code would look like this:
import functools
# Create a boolean series of your Open column
target = df.Open > 50
# For every item in your boolean series add a 1 to the previous value if it's over 50, otherwise reset
df['RDA'] = functools.reduce(lambda x, y: x + ([x[-1] + 1] if y else [0]), target, [0])[1:]
# Repeat, but for every `False` value in the series
df['RDB'] = functools.reduce(lambda x, y: x + ([x[-1] + 1] if not y else [0]), target, [0])[1:]
>>> df.head()
Open High Low Close Volume RDA RDB
Date
2004-08-19 49.96 51.98 47.93 50.12 NaN 0 1
2004-08-20 50.69 54.49 50.20 54.10 NaN 1 0
2004-08-23 55.32 56.68 54.47 54.65 NaN 2 0
2004-08-24 55.56 55.74 51.73 52.38 NaN 3 0
2004-08-25 52.43 53.95 51.89 52.95 NaN 4 0
Upvotes: 1
Reputation: 109626
First, add a flag column to the dataframe to indicate if the Open is higher than the target price of 50 (True or False).
Then, you can use the compare-cumsum-groupby pattern to identify cumulative groupings of this flag and apply a cumsum
to each such group.
We now need to invert the flag so that ones are zeroes and zeroes are ones and then use the same strategy to calculate rdb
.
Finally, we drop the flag
column (I used .iloc[:, :-1]
to drop it since I added it as the last column) and append the new RDA
and RDB
columns.
target_price = 50
df = df.assign(flag=df.Open.gt(target_price)) # True if `Open` greater than `target_price`, otherwise False.
rda = df.groupby((df['flag'] != df['flag'].shift()).cumsum()).flag.cumsum()
df['flag'] = ~df['flag'] # Invert flag for RDB.
rdb = df.groupby((df['flag'] != df['flag'].shift()).cumsum()).flag.cumsum()
df = df.iloc[:, :-1].assign(RDA=rda, RDB=rdb)
>>> df
Date Open High Low Close Volume RDA RDB
0 8/19/04 51.96 51.98 47.93 50.12 NaN 1 0
1 8/20/04 50.69 54.49 50.20 54.10 NaN 2 0
2 8/23/04 55.32 56.68 54.47 54.65 NaN 3 0
3 8/24/04 45.56 55.74 51.73 52.38 NaN 0 1
4 8/25/04 42.43 53.95 51.89 52.95 NaN 0 2
5 8/26/04 41.96 51.98 47.93 50.12 NaN 0 3
6 8/27/04 40.69 54.49 50.20 54.10 NaN 0 4
7 8/28/04 55.32 56.68 54.47 54.65 NaN 1 0
8 8/29/04 55.56 55.74 51.73 52.38 NaN 2 0
9 8/30/04 52.43 53.95 51.89 52.95 NaN 3 0
Upvotes: 2
Reputation: 5193
I originally misunderstood and didn't realize you wanted to restart the count whenever it changed. I think the best way to do this would be to use DataFrame.apply
across the rows like this:
In [226]: def increment(row):
...: global rda
...: global rdb
...: if row.Open > 50:
...: row.RDA = int(next(rda))
...: rdb = count()
...: else:
...: row.RDB = next(rdb)
...: rda = int(count())
...: return row
In [227]: df['RDA'] = 0
In [228]: df['RDB'] = 0
In [229]: df.apply(increment, axis=1)
Open High Low Close Volume RDA RDB
Date
2004-08-19 49.96 51.98 47.93 50.12 NaN 0.0 1.0
2004-08-20 50.69 54.49 50.20 54.10 NaN 0.0 0.0
2004-08-23 55.32 56.68 54.47 54.65 NaN 1.0 0.0
2004-08-24 55.56 55.74 51.73 52.38 NaN 2.0 0.0
2004-08-25 52.43 53.95 51.89 52.95 NaN 3.0 0.0
I don't know why they're showing up as floats in the columns, I guess pandas thinks that's what you want. The data originally comes from count
as an int
. And I'm not normally a fan of globals, but DataFrame.apply
couldn't access the variables when they were outside the increment
function.
Upvotes: 0
Reputation: 1191
def fun(d):
j = 0
flag = 2
rda = []
rdb = []
for i in range(d.shape[0]):
if d.loc[i,'Open'] >= 50:
if flag == 1:
j = 0
j = j + 1
rda.append(j)
rdb.append(0)
flag = 0
else:
if flag == 0:
j=0
j = j + 1
rdb.append(j)
rda.append(0)
flag = 1
return rda,rdb
df['RDA'],df['RDB'] = fun(df)
Upvotes: 0