Reputation: 321
I am trying to get the "value popularity" over time in a Pandas Dataframe.
Key RandomColumn
a 0
b ...
c does
d not
a matter
e ...
c ...
I want to use a lets, say, rolling window of 2 and count the number of times "this" value (from the row in question, has been seen).
Output:
Key RandomColumn Popularity
a 0 0
b ... 0
c does 0
d not 0
a matter 0 <--- outside past window of 2
d ... 1
c ... 0 <-- outside window too
c .... 1
c .... 2
I have been trying things like groupby but then I lose the sliding, I can get the sliding with rolling but then not so sure how to aggregate and return.
Last attempt I endup with something like:
df['popularity']=df.groupby('key').cumcount().rolling(2, min_periods=1).sum()
But that is clearly wrong.
I can always use a for-loop, or maybe apply with indexing and manual count, but wondering if there is a "proper" way to do it in Pandas.
Update: Example #2: "Count the number of times a given value X has been seen in the past N-records", for simplicity, lets say N=2
For the Series: [a, b, b, c, d, a, b, c, c, c, a, b]
The output is: [0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0]
First 2 zeros (0, 0) as, a and b, has never seen before, then (1) b has seen 1 time in the past 2 values and so on.
Upvotes: 2
Views: 2217
Reputation: 7361
Apparently you cannot use rolling
with non numeric objects. See for example this answer.
So for a general solution where you can choose your N
, the for
loop is probably the best approach.
You can define a function such as:
def popularitycount(sr, n):
"""sr is an iterable (a list, tuple, or pandas Series), n an integer"""
previous = []
res = []
for el in sr:
res.append(previous.count(el))
previous.append(el)
if len(previous) > n:
previous.pop(0)
return pd.Series(res)
Using your input, this returns the following Series
:
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 2
10 0
11 0
dtype: int64
Upvotes: 1
Reputation: 42886
Using shift()
, rolling
& np.where
:
.shift
.rolling
rolling(2)
drags on 1 row too far.m1 = df.eq(df.shift())
m2 = df.eq(df.shift()).rolling(2).sum()
df['Popularity'] = np.where(m1, m2, 0).astype(int)
Output
Key Popularity
0 a 0
1 b 0
2 b 1
3 c 0
4 d 0
5 a 0
6 b 0
7 c 0
8 c 1
9 c 2
10 a 0
11 b 0
Input dataframe used:
Key
0 a
1 b
2 b
3 c
4 d
5 a
6 b
7 c
8 c
9 c
10 a
11 b
Upvotes: 1