Reputation: 6404
I'm trying to count the frequency of the number of occurrences a value has been higher than today's value for all dates in the past.
I've looked into using pandas rolling and cumcount functions, but still cannot figure out how to do this efficiently.
Below is a working code snippet of what I'm trying to achieve with the output displayed in the image following. My gut says that this is an extremely ugly and inefficient approach, so I was hoping to find something better.
def freq_greater_than(r):
smaller_date = df[df.date < r.date]
larger_num = smaller_date[(smaller_date.num > r.num)]
return round(len(larger_num) / len(smaller_date) * 100, 2)
index = pd.date_range('2020-01-01', '2020-01-10')
df = pd.DataFrame(dict(num=np.random.random_integers(0, 100, len(index))), index=pd.Series(index, name='date'))
df['date'] = df.index
df['freq_greater_than'] = df.iloc[1:].apply(freq_greater_than, axis=1)
df.style.format({'freq_greater_than': '{:.2f}%'})
Upvotes: 1
Views: 109
Reputation: 35626
Try with expanding apply
+ mean
over gt
comparison:
df['freq_greater_than'] = (
df['num'].expanding()
.apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
)
df
:
num date freq_greater_than
date
2020-01-01 72 2020-01-01 NaN
2020-01-02 28 2020-01-02 100.000000
2020-01-03 47 2020-01-03 50.000000
2020-01-04 2 2020-01-04 100.000000
2020-01-05 73 2020-01-05 0.000000
2020-01-06 8 2020-01-06 80.000000
2020-01-07 87 2020-01-07 0.000000
2020-01-08 94 2020-01-08 0.000000
2020-01-09 40 2020-01-09 62.500000
2020-01-10 25 2020-01-10 77.777778
Optional map
for string formatting:
df['freq_greater_than'] = (
df['num'].expanding()
.apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
.map('{:.2f}%'.format)
)
num date freq_greater_than
date
2020-01-01 72 2020-01-01 nan%
2020-01-02 28 2020-01-02 100.00%
2020-01-03 47 2020-01-03 50.00%
2020-01-04 2 2020-01-04 100.00%
2020-01-05 73 2020-01-05 0.00%
2020-01-06 8 2020-01-06 80.00%
2020-01-07 87 2020-01-07 0.00%
2020-01-08 94 2020-01-08 0.00%
2020-01-09 40 2020-01-09 62.50%
2020-01-10 25 2020-01-10 77.78%
DataFrame Constructor and Imports:
import pandas as pd
index = pd.date_range('2020-01-01', '2020-01-10')
df = pd.DataFrame({'num': [72, 28, 47, 2, 73, 8, 87, 94, 40, 25]},
index=pd.Series(index, name='date'))
df['date'] = df.index
Upvotes: 1
Reputation: 18296
Here is one way with expanding
:
df["freq_gt"] = (df.num
.expanding()
.agg(lambda x: 100 * x.iloc[:-1].gt(x.iat[-1]).sum() / (x.size - 1)))
where expanding
sends ever-increasing windows over num
column and x
represents the windows. Then we look at the values except for the last one via iloc[:-1]
and compare it (gt
) with the last one (iat[-1]
). Then sum
counts the number of True
s resulted from comparison which gives the count of those that are greater. Dividing it by the window size minus 1 to exclude the current value and 100 *
gives the ratio,
to get
num freq_gt
date
2020-01-01 46 NaN
2020-01-02 35 100.000000
2020-01-03 80 0.000000
2020-01-04 4 100.000000
2020-01-05 43 50.000000
2020-01-06 4 80.000000
2020-01-07 7 66.666667
2020-01-08 91 0.000000
2020-01-09 95 0.000000
2020-01-10 99 0.000000
Upvotes: 1
Reputation: 195418
You can use .expanding()
:
# convert date and sort dataframe:
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by="date")
df["freq_greater_than"] = (
df["num"]
.expanding(1)
.apply(lambda x: ((x.iat[-1] < x[:-1]).sum() / (len(x) - 1)) * 100)
)
print(df)
Prints:
date num freq_greater_than
0 2020-01-01 72 NaN
1 2020-01-02 28 100.0
2 2020-01-03 47 50.0
3 2020-01-04 2 100.0
4 2020-01-05 73 0.0
5 2020-01-06 8 80.0
...
Upvotes: 1