Michael
Michael

Reputation: 7726

Groupwise sorting in pandas

I want to sort an array within the group boundaries defined in another array. The groups are not presorted in any way and need to remain unchanged after the sorting. In numpy terms it would look like this:

import numpy as np

def groupwise_sort(group_idx, a, reverse=False):
    sortidx = np.lexsort((-a if reverse else a, group_idx))
    # Reverse sorting back to into grouped order, but preserving groupwise sorting
    revidx = np.argsort(np.argsort(group_idx, kind='mergesort'), kind='mergesort')
    return a[sortidx][revidx]

group_idx =   np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a =           np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])
groupwise_sort(group_idx, a)
# >>>            array([1, 2, 3, 4, 5, 1, 7, 5, 9])
groupwise_sort(group_idx, a, reverse=True)
# >>>            array([3, 7, 1, 5, 4, 9, 2, 5, 1])

How can I do the same with pandas? I saw df.groupby() and df.sort_values(), though I couldn't find a straight forward way to achieve the same sorting. And a fast one, if possible.

Upvotes: 6

Views: 1497

Answers (2)

coffeinjunky
coffeinjunky

Reputation: 11514

Let us first set the stage:

import pandas as pd
import numpy as np

group_idx =   np.array([3, 2, 3, 2, 2, 1, 2, 1, 1])
a =           np.array([3, 2, 1, 7, 4, 5, 5, 9, 1])

df = pd.DataFrame({'group': group_idx, 'values': a})
df
#   group  values
#0      3       3
#1      2       2
#2      3       1
#3      2       7
#4      2       4
#5      1       5
#6      2       5
#7      1       9
#8      1       1

To get a dataframe sorted by group and values (within groups):

df.sort_values(["group", "values"])

#   group  values
#8      1       1
#5      1       5
#7      1       9
#1      2       2
#4      2       4
#6      2       5
#3      2       7
#2      3       1
#0      3       3

To sort the values in descending order, use ascending = False. To apply different orders to different columns, you can supply a list:

df.sort_values(["group", "values"], ascending = [True, False])

#   group  values
#7      1       9
#5      1       5
#8      1       1
#3      2       7
#6      2       5
#4      2       4
#1      2       2
#0      3       3
#2      3       1

Here, groups are sorted in ascending order, and the values within each group are sorted in descending order.

To only sort values for contiguous rows belonging to the same group, create a new group indicator:

(I keep this in here for reference since it might be helpful for others. I wrote this in an earlier version before the OP clarified his question in the comments.)

df['new_grp'] = (df.group.diff(1) != 0).astype('int').cumsum()
df
#   group  values  new_grp
#0      3       3        1
#1      2       2        2
#2      3       1        3
#3      2       7        4
#4      2       4        4
#5      1       5        5
#6      2       5        6
#7      1       9        7
#8      1       1        7

We can then easily sort with new_grp instead of group, leaving the original order of groups untouched.

Ordering within groups but keeping the group-specifing row-positions:

To sort the elements of each group but keep the group-specific positions in the dataframe, we need to keep track of the original row numbers. For instance, the following will do the trick:

# First, create an indicator for the original row-number:

df["ind"] = range(len(df))

# Now, sort the dataframe as before
df_sorted = df.sort_values(["group", "values"])

# sort the original row-numbers within each group
newindex = df.groupby("group").apply(lambda x: x.sort_values(["ind"]))["ind"].values

# assign the sorted row-numbers to the sorted dataframe
df_sorted["ind"] = newindex

# Sort based on the row-numbers:
sorted_asc = df_sorted.sort_values("ind")

# compare the resulting order of values with your desired output:
np.array(sorted_asc["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])

This is easier to test and profile when written up in a function, so let's do that:

def sort_my_frame(frame, groupcol = "group", valcol = "values", asc = True):

    frame["ind"] = range(len(frame))
    frame_sorted = frame.sort_values([groupcol, valcol], ascending = [True, asc])
    ind_sorted = frame.groupby(groupcol).apply(lambda x: x.sort_values(["ind"]))["ind"].values
    frame_sorted["ind"] = ind_sorted
    frame_sorted = frame_sorted.sort_values(["ind"])

    return(frame_sorted.drop(columns = "ind"))

np.array(sort_my_frame(df, "group", "values", asc = True)["values"])
# array([1, 2, 3, 4, 5, 1, 7, 5, 9])
np.array(sort_my_frame(df, "group", "values", asc = False)["values"])
# array([3, 7, 1, 5, 4, 9, 2, 5, 1])

Note that the latter results match your desired outcome.

I am sure this can be written up in a more succinct way. For instance, if the index of your dataframe is already ordered, you can use that one instead of the indicator ind I create (i.e., following @DJK's comment, we can use sort_index instead of sort_values and avoid assigning an additional column). In any case, the above highlights one possible solution and how to approach it. An alternative would be to use your numpy functions and wrap the output around a pd.DataFrame.

Upvotes: 5

floydn
floydn

Reputation: 1131

Pandas is built on top of numpy. Assuming a dataframe like so:

df
Out[21]: 
   group  values
0      3       3
1      2       2
2      3       1
3      2       7
4      2       4
5      1       5
6      2       5
7      1       9
8      1       1

Call your function.

groupwise_sort(df.group.values, df['values'].values)
Out[22]: array([1, 2, 3, 4, 5, 1, 7, 5, 9])

groupwise_sort(df.group.values, df['values'].values, reverse=True)
Out[23]: array([3, 7, 1, 5, 4, 9, 2, 5, 1])

Upvotes: 1

Related Questions