Reputation: 33
I am trying to sort a pandas dataframe block-wise without changing the order within blocks.
The dataframe contains forum posts, timestamps, and thread names. I have already sorted the dataframe such that all posts belonging to the same thread are in the right order using df.sort_values(['thread', 'timestamp'], inplace=True)
. I now want to sort the data blocks belonging to the same thread according to the timestamp of the first post in each block. The order within the blocks should remain unchanged.
What I currently have:
post timestamp thread
0 this 2009/10/30 16:51 hello
1 be 2009/11/02 17:11 hello
2 some 2008/07/10 15:23 nice
3 text 2007/04/22 14:11 question
4 this 2007/04/24 11:03 question
5 be 2007/05/03 17:55 question
6 some 2004/09/01 09:32 game
7 text 2010/01/01 03:32 wheather
What I want:
post timestamp thread
6 some 2004/09/01 09:32 game
3 text 2007/04/22 14:11 question
4 this 2007/04/24 11:03 question
5 be 2007/05/03 17:55 question
2 some 2008/07/10 15:23 nice
0 this 2009/10/30 16:51 hello
1 be 2009/11/02 17:11 hello
7 text 2010/01/01 03:32 wheather
Is there a way to do this?
Upvotes: 3
Views: 1117
Reputation: 323226
Using sort_values
with drop_duplicates
get the min, then we using Categorical
cate=df.sort_values('timestamp').drop_duplicates('thread')
df.thread=pd.Categorical(df.thread,ordered=True,categories=cate.thread.tolist())
df=df.sort_values('thread')
df
post timestamp thread
6 some 2004-09-01 09:32:00 game
3 text 2007-04-22 14:11:00 question
4 this 2007-04-24 11:03:00 question
5 be 2007-05-03 17:55:00 question
2 some 2008-07-10 15:23:00 nice
0 this 2009-10-30 16:51:00 hello
1 be 2009-11-02 17:11:00 hello
7 text 2010-01-01 03:32:00 wheather
Upvotes: 4
Reputation: 153460
Let's try to first groupby
thread then get the first record, sort those records by time, then use the groups
attribute of DataFrameGroupBy to get the current order of index in each group. Lastly, use pd.concat
and list comprehension to rebuild dataframe in sorted order of first records.
g = df.groupby('thread')
s = g.head(1).sort_values('timestamp')['thread']
dg = g.groups
pd.concat([df.reindex(dg[i[1]]) for i in s.iteritems()])
Output:
post timestamp thread
6 some 2004-09-01 09:32:00 game
3 text 2007-04-22 14:11:00 question
4 this 2007-04-24 11:03:00 question
5 be 2007-05-03 17:55:00 question
2 some 2008-07-10 15:23:00 nice
0 this 2009-10-30 16:51:00 hello
1 be 2009-11-02 17:11:00 hello
7 text 2010-01-01 03:32:00 wheather
Upvotes: 4
Reputation: 402323
argsort
it. groupby
, taking advantage of the fact that groupby
sorts the groups by key, but does not change the order within groups. concat
the resultant groups in sorted order.idx = df['thread'].map(df.groupby('thread')['timestamp'].first().argsort())
idx
0 3
1 3
2 2
3 1
4 1
5 1
6 0
7 4
Name: thread, dtype: int64
pd.concat([g for _, g in df.groupby(idx)])
post timestamp thread
6 some 2004/09/01 09:32 game
3 text 2007/04/22 14:11 question
4 this 2007/04/24 11:03 question
5 is 2007/05/03 17:55 question
2 some 2008/07/10 15:23 nice
0 this 2009/10/30 16:51 hello
1 is 2009/11/02 17:11 hello
7 text 2010/01/01 03:32 wheather
Upvotes: 4
Reputation: 29635
one way is to create a temporary column for example named 'first_ts' with groupby
on 'thread' and transform
to get the min
(so first date) on the column 'timestamp' for each thread. Now you can sort_values
by this column and drop
the temporary column.
# you might need to convert timestamp to datetime
df.timestamp = pd.to_datetime(df.timestamp)
#create the column
df['first_ts'] = df.groupby('thread').timestamp.transform(min)
#sort and drop
df = df.sort_values(['first_ts']).drop('first_ts',axis=1)
and you get as expected
print(df)
post timestamp thread
6 some 2004-09-01 09:32:00 game
3 text 2007-04-22 14:11:00 question
4 this 2007-04-24 11:03:00 question
5 be 2007-05-03 17:55:00 question
2 some 2008-07-10 15:23:00 nice
0 this 2009-10-30 16:51:00 hello
1 be 2009-11-02 17:11:00 hello
7 text 2010-01-01 03:32:00 wheather
or if you don't want create the column, you can also use reindex
with the index of the sorted value of groupby
such as:
df = df.reindex(df.groupby('thread').timestamp.transform(min)
.sort_values().index)
Upvotes: 3