Olshansky
Olshansky

Reputation: 6404

Pandas DataFrame - Compute frequency of number w/ a greater value before current date

I'm trying to count the frequency of the number of occurrences a value has been higher than today's value for all dates in the past.

I've looked into using pandas rolling and cumcount functions, but still cannot figure out how to do this efficiently.

Below is a working code snippet of what I'm trying to achieve with the output displayed in the image following. My gut says that this is an extremely ugly and inefficient approach, so I was hoping to find something better.

    def freq_greater_than(r):
        smaller_date = df[df.date < r.date]
        larger_num = smaller_date[(smaller_date.num > r.num)]
        return round(len(larger_num) / len(smaller_date) * 100, 2)
    
    index = pd.date_range('2020-01-01', '2020-01-10')
    df = pd.DataFrame(dict(num=np.random.random_integers(0, 100, len(index))), index=pd.Series(index, name='date'))
    df['date'] = df.index
    df['freq_greater_than'] = df.iloc[1:].apply(freq_greater_than, axis=1)
    df.style.format({'freq_greater_than': '{:.2f}%'})

enter image description here

Upvotes: 1

Views: 109

Answers (3)

Henry Ecker
Henry Ecker

Reputation: 35626

Try with expanding apply + mean over gt comparison:

df['freq_greater_than'] = (
    df['num'].expanding()
        .apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
)

df:

            num       date  freq_greater_than
date                                         
2020-01-01   72 2020-01-01                NaN
2020-01-02   28 2020-01-02         100.000000
2020-01-03   47 2020-01-03          50.000000
2020-01-04    2 2020-01-04         100.000000
2020-01-05   73 2020-01-05           0.000000
2020-01-06    8 2020-01-06          80.000000
2020-01-07   87 2020-01-07           0.000000
2020-01-08   94 2020-01-08           0.000000
2020-01-09   40 2020-01-09          62.500000
2020-01-10   25 2020-01-10          77.777778

Optional map for string formatting:

df['freq_greater_than'] = (
    df['num'].expanding()
        .apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
        .map('{:.2f}%'.format)
)
            num       date freq_greater_than
date                                        
2020-01-01   72 2020-01-01              nan%
2020-01-02   28 2020-01-02           100.00%
2020-01-03   47 2020-01-03            50.00%
2020-01-04    2 2020-01-04           100.00%
2020-01-05   73 2020-01-05             0.00%
2020-01-06    8 2020-01-06            80.00%
2020-01-07   87 2020-01-07             0.00%
2020-01-08   94 2020-01-08             0.00%
2020-01-09   40 2020-01-09            62.50%
2020-01-10   25 2020-01-10            77.78%

DataFrame Constructor and Imports:

import pandas as pd

index = pd.date_range('2020-01-01', '2020-01-10')
df = pd.DataFrame({'num': [72, 28, 47, 2, 73, 8, 87, 94, 40, 25]},
                  index=pd.Series(index, name='date'))
df['date'] = df.index

Upvotes: 1

Mustafa Aydın
Mustafa Aydın

Reputation: 18296

Here is one way with expanding:

df["freq_gt"] = (df.num
                   .expanding()
                   .agg(lambda x: 100 * x.iloc[:-1].gt(x.iat[-1]).sum() / (x.size - 1)))

where expanding sends ever-increasing windows over num column and x represents the windows. Then we look at the values except for the last one via iloc[:-1] and compare it (gt) with the last one (iat[-1]). Then sum counts the number of Trues resulted from comparison which gives the count of those that are greater. Dividing it by the window size minus 1 to exclude the current value and 100 * gives the ratio,

to get

            num     freq_gt
date
2020-01-01   46         NaN
2020-01-02   35  100.000000
2020-01-03   80    0.000000
2020-01-04    4  100.000000
2020-01-05   43   50.000000
2020-01-06    4   80.000000
2020-01-07    7   66.666667
2020-01-08   91    0.000000
2020-01-09   95    0.000000
2020-01-10   99    0.000000

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195418

You can use .expanding():

# convert date and sort dataframe:
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by="date")

df["freq_greater_than"] = (
    df["num"]
    .expanding(1)
    .apply(lambda x: ((x.iat[-1] < x[:-1]).sum() / (len(x) - 1)) * 100)
)
print(df)

Prints:

        date  num  freq_greater_than
0 2020-01-01   72                NaN
1 2020-01-02   28              100.0
2 2020-01-03   47               50.0
3 2020-01-04    2              100.0
4 2020-01-05   73                0.0
5 2020-01-06    8               80.0

...

Upvotes: 1

Related Questions