Reputation: 10011
Given a dataframe I have as follows:
import pandas as pd
import datetime
df = pd.DataFrame([[2, 3],[2, 1],[2, 1],[3, 4],[3, 1],[3, 1],[3, 1],[3, 1],[4, 2],[4, 1],[4, 1],[4, 1]], columns=['id', 'count'])
df['date'] = [datetime.datetime.strptime(x,'%Y-%m-%d %H:%M:%S') for x in
['2016-12-28 15:17:00','2016-12-28 15:29:00','2017-01-05 09:32:00','2016-12-03 18:10:00','2016-12-10 11:31:00',
'2016-12-14 09:32:00','2016-12-18 09:31:00','2016-12-22 09:32:00','2016-11-28 15:31:00','2016-12-01 16:11:00',
'2016-12-10 09:31:00','2016-12-13 12:06:00']]
I would like to if grouby based on the condition: for the data has same id
, if their date difference is less than 4 days, then consider them as same groups, otherwise create a new column new_id
, then I will grouby and count sum based on new_id
.
I have get the result with the following code, but it's too slow, how could I make it more effeciently?
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
df['id'] = df['id'].astype(str)
df['id_up'] = df['id'].shift(-1)
df['id_down'] = df['id'].shift(1)
df['date_up'] = df['date'].shift(-1)
df['date_diff'] = df.apply(lambda df: (df['date'] - df['date_up'])/datetime.timedelta(days=1) if df['id'] == df['id_up'] else 0, axis=1)
df = df.reset_index()
df = df.drop(['index','id_up','id_down','date_up'],axis=1)
df['new'] = ''
for i in range(df.shape[0]):
if i == 0:
df.loc[i,'new'] = 1
else:
if df.loc[i,'id'] != df.loc[i-1,'id']:
df.loc[i,'new'] = 1
else:
if df.loc[i-1,'date_diff'] <= 4:
df.loc[i,'new'] = df.loc[i-1,'new']
else:
df.loc[i,'new'] = df.loc[i-1,'new'] + 1
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
df1 = df.groupby('new')['date'].min()
df1 = df1.reset_index()
df1.rename(columns={"date": "first_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['date'].max()
df1 = df1.reset_index()
df1.rename(columns={"date": "last_date"}, inplace=True)
df = pd.merge(df, df1, on='new')
df1 = df.groupby('new')['count'].sum()
df1 = df1.reset_index()
df1.rename(columns={"count": "count_sum"}, inplace=True)
df = pd.merge(df, df1, on='new')
print(df)
Out:
id count date date_diff new first_date last_date count_sum
0 2 1 2017-01-05 09:32:00 7.752083 2-1 2017-01-05 09:32:00 2017-01-05 09:32:00 1
1 2 1 2016-12-28 15:29:00 0.008333 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
2 2 3 2016-12-28 15:17:00 0.000000 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
3 3 1 2016-12-22 09:32:00 4.000694 3-1 2016-12-22 09:32:00 2016-12-22 09:32:00 1
4 3 1 2016-12-18 09:31:00 3.999306 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
5 3 1 2016-12-14 09:32:00 3.917361 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
6 3 1 2016-12-10 11:31:00 6.722917 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
7 3 4 2016-12-03 18:10:00 0.000000 3-3 2016-12-03 18:10:00 2016-12-03 18:10:00 4
8 4 1 2016-12-13 12:06:00 3.107639 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
9 4 1 2016-12-10 09:31:00 8.722222 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
10 4 1 2016-12-01 16:11:00 3.027778 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
11 4 2 2016-11-28 15:31:00 0.000000 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
Upvotes: 1
Views: 130
Reputation: 10011
Another solution:
df.sort_values(by=['id', 'date'], ascending=[True, False], inplace=True)
interval_date = 4
groups = df.groupby('id')
# interval_date = pd.to_timedelta(4, unit='D')
df['date_diff_down'] = groups.date.diff(-1).abs()/timedelta(days=1)
df = df.fillna(method='ffill')
df['date_diff_up'] = groups.date.diff(1).abs()/timedelta(days=1)
df = df.fillna(method='bfill')
df['data_chunk_mark'] = df.apply(lambda df: 0 if df['date_diff_up'] < interval_date else 1, axis=1)
groups = df.groupby('id')
df['new_id'] = groups['data_chunk_mark'].cumsum().astype(int) + 1
df['new_id'] = df['id'].astype(str) + '-' + df['new_id'].astype(str)
new_groups = df.groupby('new_id')
# df['first_date'] = new_groups.date.transform('min')
# df['last_date'] = new_groups.date.transform('max')
df['count_sum'] = new_groups['count'].transform('sum')
print(df)
Out:
id count date date_diff_down date_diff_up \
1 2 1 2017-01-05 09:32:00 7.752083 7.752083
2 2 1 2016-12-28 15:29:00 0.008333 7.752083
0 2 3 2016-12-28 15:17:00 0.008333 0.008333
7 3 1 2016-12-22 09:32:00 4.000694 4.000694
6 3 1 2016-12-18 09:31:00 3.999306 4.000694
5 3 1 2016-12-14 09:32:00 3.917361 3.999306
4 3 1 2016-12-10 11:31:00 6.722917 3.917361
3 3 4 2016-12-03 18:10:00 6.722917 6.722917
11 4 1 2016-12-13 12:06:00 3.107639 3.107639
10 4 1 2016-12-10 09:31:00 8.722222 3.107639
9 4 1 2016-12-01 16:11:00 3.027778 8.722222
8 4 2 2016-11-28 15:31:00 3.027778 3.027778
data_chunk_mark new_id count_sum
1 1 2-2 1
2 1 2-3 4
0 0 2-3 4
7 1 3-2 1
6 1 3-3 3
5 0 3-3 3
4 0 3-3 3
3 1 3-4 4
11 0 4-1 2
10 0 4-1 2
9 1 4-2 3
8 0 4-2 3
Upvotes: 0
Reputation: 7211
In pandas, the groupby
can take a function that transforms row index to group by labels, and it is called iteratively on each row. By using this, we can do the following:
# sort dataframe by id and date in ascending order
df = df.sort_values(["id", "date"]).reset_index(drop=True)
# global variable for convenience of demonstration
lastid = maxdate = None
groupid = 0
def grouper(rowidx):
global lastid, maxdate, groupid
row = df.loc[rowidx]
if lastid != row['id'] or maxdate < row['date']:
# see next group
lastid = row['id']
maxdate = row['date'] + datetime.timedelta(days=4)
groupid += 1
return groupid
# use grouper to split df into groups
for id, group in df.groupby(grouper):
print("[%s]" % id)
print(group)
The output of the above using your df
is:
[1]
id count date
0 2 3 2016-12-28 15:17:00
1 2 1 2016-12-28 15:29:00
[2]
id count date
2 2 1 2017-01-05 09:32:00
[3]
id count date
3 3 4 2016-12-03 18:10:00
[4]
id count date
4 3 1 2016-12-10 11:31:00
5 3 1 2016-12-14 09:32:00
[5]
id count date
6 3 1 2016-12-18 09:31:00
[6]
id count date
7 3 1 2016-12-22 09:32:00
[7]
id count date
8 4 2 2016-11-28 15:31:00
9 4 1 2016-12-01 16:11:00
[8]
id count date
10 4 1 2016-12-10 09:31:00
11 4 1 2016-12-13 12:06:00
and you can make an arbitrary group-by logic using this mechanism.
Upvotes: 1
Reputation: 150735
To get the new
column you can do some thing like this:
df.sort_values(by=['id', 'date'], ascending = [True, False], inplace = True)
groups = df.groupby('id')
# mask where the date differences exceed threshold
df['new'] = groups.date.diff().abs() > pd.to_timedelta(4, unit='D')
# group within each id
df['new'] = groups['new'].cumsum().astype(int) + 1
# concatenate `id` and `new`:
df['new'] = df['id'].astype(str) + '-' + df['new'].astype(str)
# get other columns with groupby
new_groups = df.groupby('new')
df['first_date'] = new_groups.date.transform('min')
df['last_date'] = new_groups.date.transform('max')
df['count_sum'] = new_groups['count'].transform('sum')
Output:
id count date new first_date last_date count_sum
-- ---- ------- ------------------- ----- ------------------- ------------------- -----------
0 2 1 2017-01-05 09:32:00 2-1 2017-01-05 09:32:00 2017-01-05 09:32:00 1
1 2 1 2016-12-28 15:29:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
2 2 3 2016-12-28 15:17:00 2-2 2016-12-28 15:17:00 2016-12-28 15:29:00 4
3 3 1 2016-12-22 09:32:00 3-1 2016-12-22 09:32:00 2016-12-22 09:32:00 1
4 3 1 2016-12-18 09:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
5 3 1 2016-12-14 09:32:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
6 3 1 2016-12-10 11:31:00 3-2 2016-12-10 11:31:00 2016-12-18 09:31:00 3
7 3 4 2016-12-03 18:10:00 3-3 2016-12-03 18:10:00 2016-12-03 18:10:00 4
8 4 1 2016-12-13 12:06:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
9 4 1 2016-12-10 09:31:00 4-1 2016-12-10 09:31:00 2016-12-13 12:06:00 2
10 4 1 2016-12-01 16:11:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
11 4 2 2016-11-28 15:31:00 4-2 2016-11-28 15:31:00 2016-12-01 16:11:00 3
Upvotes: 1