Reputation: 6132
So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID
there are by group
in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount()
and pd.DateOffset
or maybe ge(df.Period - dt.timedelta(365)
, but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID
in a given Period
Upvotes: 4
Views: 2062
Reputation: 25239
Here the solution using groupby
and rolling
. Note: your desired ouput counts a year from YYYY0101
to next year YYYY0101
, so you need rolling 366D
instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.
Upvotes: 1
Reputation: 17122
looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates
tend to be fast
I am assuming that df['Period']
columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period']
is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')
Upvotes: 3
Reputation: 2775
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)
Upvotes: 0