Reputation: 21
I'd like to take the following data, and check for each day whether the animal was observed the previous day, then create a count per day of new animals observed.
import pandas as pd
data = {'Date': pd.to_datetime(['18/08/2018', '18/08/2018', '18/08/2018',
'19/08/2018', '19/08/2018', '19/08/2018',
'19/08/2018', '19/08/2018', '20/08/2018',
'20/08/2018', '20/08/2018']),
'Animal': ['cat', 'dog', 'mouse', 'cat', 'dog', 'mouse', 'rabbit', 'rat', 'lion', 'tiger', 'monkey']
}
df = pd.DataFrame(data)
With a result something like:
1. 18/08/2018 3 2. 19/08/2018 2 3. 20/08/2018 3
I'm very new to Python, so any help very appreciated! Thx.
Upvotes: 2
Views: 551
Reputation: 75080
Here is another proposal which uses aggregation as set then shift
and check difference
m = df.groupby('Date')['Animal'].agg(set)
n = m.str.len()
n.iloc[1:] = [len(a.difference(b)) for a,b in zip(m,m.shift().fillna(m.head(1)))][1:]
print(n)
print(n)
Date
2018-08-18 3
2018-08-19 2
2018-08-20 3
dtype: int64
Upvotes: 3
Reputation: 88236
Here's one approach using pd.factorize
:
s = (pd.Series(pd.factorize(df.Animal)[0]).groupby(df.Date).max()+1)
# decumulate and fill first row
s.diff().fillna(s)
Date
2018-08-18 3.0
2018-08-19 2.0
2018-08-20 3.0
dtype: float64
Where by factorizing we are encoding as an enumerated type:
pd.factorize(df.Animal)[0]
# array([0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7], dtype=int64)
And by grouping by the Date
and obtaining the max
, we are getting the acumulated amount of new animals:
Date
2018-08-18 3
2018-08-19 5
2018-08-20 8
dtype: int64
Now we can just obtain the diff
to decumulate the Series
:
Upvotes: 3