D. D. Dickens
D. D. Dickens

Reputation: 33

Sorting groups by the first value without changing the group order

I am trying to sort a pandas dataframe block-wise without changing the order within blocks.

The dataframe contains forum posts, timestamps, and thread names. I have already sorted the dataframe such that all posts belonging to the same thread are in the right order using df.sort_values(['thread', 'timestamp'], inplace=True). I now want to sort the data blocks belonging to the same thread according to the timestamp of the first post in each block. The order within the blocks should remain unchanged.

What I currently have:

    post   timestamp         thread
0   this   2009/10/30 16:51  hello   
1   be     2009/11/02 17:11  hello
2   some   2008/07/10 15:23  nice
3   text   2007/04/22 14:11  question
4   this   2007/04/24 11:03  question
5   be     2007/05/03 17:55  question
6   some   2004/09/01 09:32  game
7   text   2010/01/01 03:32  wheather

What I want:

    post   timestamp         thread
6   some   2004/09/01 09:32  game
3   text   2007/04/22 14:11  question
4   this   2007/04/24 11:03  question
5   be     2007/05/03 17:55  question
2   some   2008/07/10 15:23  nice
0   this   2009/10/30 16:51  hello   
1   be     2009/11/02 17:11  hello
7   text   2010/01/01 03:32  wheather

Is there a way to do this?

Upvotes: 3

Views: 1117

Answers (4)

BENY
BENY

Reputation: 323226

Using sort_values with drop_duplicates get the min, then we using Categorical

cate=df.sort_values('timestamp').drop_duplicates('thread')
df.thread=pd.Categorical(df.thread,ordered=True,categories=cate.thread.tolist())
df=df.sort_values('thread')
df
   post           timestamp    thread
6  some 2004-09-01 09:32:00      game
3  text 2007-04-22 14:11:00  question
4  this 2007-04-24 11:03:00  question
5    be 2007-05-03 17:55:00  question
2  some 2008-07-10 15:23:00      nice
0  this 2009-10-30 16:51:00     hello
1    be 2009-11-02 17:11:00     hello
7  text 2010-01-01 03:32:00  wheather

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

Let's try to first groupby thread then get the first record, sort those records by time, then use the groups attribute of DataFrameGroupBy to get the current order of index in each group. Lastly, use pd.concat and list comprehension to rebuild dataframe in sorted order of first records.

g = df.groupby('thread')
s = g.head(1).sort_values('timestamp')['thread']
dg = g.groups

pd.concat([df.reindex(dg[i[1]]) for i in s.iteritems()])

Output:

   post           timestamp    thread
6  some 2004-09-01 09:32:00      game
3  text 2007-04-22 14:11:00  question
4  this 2007-04-24 11:03:00  question
5    be 2007-05-03 17:55:00  question
2  some 2008-07-10 15:23:00      nice
0  this 2009-10-30 16:51:00     hello
1    be 2009-11-02 17:11:00     hello
7  text 2010-01-01 03:32:00  wheather

Upvotes: 4

cs95
cs95

Reputation: 402323

  1. First, get the first "timestamp" of each group and argsort it.
  2. Next, use groupby, taking advantage of the fact that groupby sorts the groups by key, but does not change the order within groups.
  3. Finally, concat the resultant groups in sorted order.

idx = df['thread'].map(df.groupby('thread')['timestamp'].first().argsort())
idx

0    3
1    3
2    2
3    1
4    1
5    1
6    0
7    4
Name: thread, dtype: int64

pd.concat([g for _, g in df.groupby(idx)])

   post         timestamp    thread
6  some  2004/09/01 09:32      game
3  text  2007/04/22 14:11  question
4  this  2007/04/24 11:03  question
5    is  2007/05/03 17:55  question
2  some  2008/07/10 15:23      nice
0  this  2009/10/30 16:51     hello
1    is  2009/11/02 17:11     hello
7  text  2010/01/01 03:32  wheather

Upvotes: 4

Ben.T
Ben.T

Reputation: 29635

one way is to create a temporary column for example named 'first_ts' with groupby on 'thread' and transform to get the min (so first date) on the column 'timestamp' for each thread. Now you can sort_values by this column and drop the temporary column.

# you might need to convert timestamp to datetime 
df.timestamp = pd.to_datetime(df.timestamp)
#create the column
df['first_ts'] = df.groupby('thread').timestamp.transform(min)
#sort and drop
df = df.sort_values(['first_ts']).drop('first_ts',axis=1)

and you get as expected

print(df)
   post           timestamp    thread
6  some 2004-09-01 09:32:00      game
3  text 2007-04-22 14:11:00  question
4  this 2007-04-24 11:03:00  question
5    be 2007-05-03 17:55:00  question
2  some 2008-07-10 15:23:00      nice
0  this 2009-10-30 16:51:00     hello
1    be 2009-11-02 17:11:00     hello
7  text 2010-01-01 03:32:00  wheather

or if you don't want create the column, you can also use reindex with the index of the sorted value of groupby such as:

df = df.reindex(df.groupby('thread').timestamp.transform(min)
                  .sort_values().index)

Upvotes: 3

Related Questions