RandomGuy42
RandomGuy42

Reputation: 321

Get value count over time in pandas with rolling window

I am trying to get the "value popularity" over time in a Pandas Dataframe.

Key RandomColumn
a   0
b   ...
c   does
d   not
a   matter
e   ...
c   ...

I want to use a lets, say, rolling window of 2 and count the number of times "this" value (from the row in question, has been seen).

Output:

Key RandomColumn Popularity
a   0            0
b   ...          0
c   does         0
d   not          0
a   matter       0  <--- outside past window of 2
d   ...          1
c   ...          0  <-- outside window too
c   ....         1
c   ....         2

I have been trying things like groupby but then I lose the sliding, I can get the sliding with rolling but then not so sure how to aggregate and return.

Last attempt I endup with something like:

df['popularity']=df.groupby('key').cumcount().rolling(2, min_periods=1).sum()

But that is clearly wrong.

I can always use a for-loop, or maybe apply with indexing and manual count, but wondering if there is a "proper" way to do it in Pandas.

Update: Example #2: "Count the number of times a given value X has been seen in the past N-records", for simplicity, lets say N=2

For the Series: [a, b, b, c, d, a, b, c, c, c, a, b]
The output is:  [0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0]

First 2 zeros (0, 0) as, a and b, has never seen before, then (1) b has seen 1 time in the past 2 values and so on.

Upvotes: 2

Views: 2217

Answers (2)

Valentino
Valentino

Reputation: 7361

Apparently you cannot use rolling with non numeric objects. See for example this answer.
So for a general solution where you can choose your N, the for loop is probably the best approach.

You can define a function such as:

def popularitycount(sr, n):
    """sr is an iterable (a list, tuple, or pandas Series), n an integer"""
    previous = []
    res = []
    for el in sr:
        res.append(previous.count(el))
        previous.append(el)
        if len(previous) > n:
            previous.pop(0)
    return pd.Series(res)

Using your input, this returns the following Series:

0     0
1     0
2     1
3     0
4     0
5     0
6     0
7     0
8     1
9     2
10    0
11    0
dtype: int64

Upvotes: 1

Erfan
Erfan

Reputation: 42886

Using shift(), rolling & np.where:

  1. First we mark all the rows where the previous row is equal to the current one with .shift
  2. We calculate the rolling sum with .rolling
  3. We conditionally fill in 0 if the previous row is not equal to current one to correct some rows, since rolling(2) drags on 1 row too far.
m1 = df.eq(df.shift())
m2 = df.eq(df.shift()).rolling(2).sum()

df['Popularity'] = np.where(m1, m2, 0).astype(int)

Output

   Key  Popularity
0    a           0
1    b           0
2    b           1
3    c           0
4    d           0
5    a           0
6    b           0
7    c           0
8    c           1
9    c           2
10   a           0
11   b           0

Input dataframe used:

   Key
0    a
1    b
2    b
3    c
4    d
5    a
6    b
7    c
8    c
9    c
10   a
11   b

Upvotes: 1

Related Questions