Boris Polonsky
Boris Polonsky

Reputation: 125

Sorting pandas.DataFrame in python sorted() function manner

Description

Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.

The approach used for now

I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.

I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.

Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.

Code

Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.

def sort_table_by_key(self, ascending=True, key=len):
    """
    Sort the table inplace.
    """
    # column_tmp = "".join(self._table.columns)
    column_tmp = object() # Create a new object to avoid column name collision.
    # Calculate the scores of the objects. 
    self._table[column_tmp] = self._table["snippet"].apply(key)
    self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
    del self._table[column_tmp]

Upvotes: 2

Views: 1054

Answers (1)

jezrael
jezrael

Reputation: 863166

Now this is not implemented, check github issue 3942.

I think you need argsort and then select by iloc:

df = pd.DataFrame({
    'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
    'B': list(range(9))
})
print (df)
        A  B
0  assdsd  0
1     sda  1
2    affd  2
3  asddsd  3
4     ffb  4
5     sdb  5
6      db  6
7      cf  7
8       d  8

def sort_table_by_length(column, ascending=True):
    if ascending:
        return df.iloc[df[column].str.len().argsort()]
    else:
        return df.iloc[df[column].str.len().argsort()[::-1]]

print (sort_table_by_length('A'))
        A  B
8       d  8
6      db  6
7      cf  7
1     sda  1
4     ffb  4
5     sdb  5
2    affd  2
0  assdsd  0
3  asddsd  3

print (sort_table_by_length('A', False))
        A  B
3  asddsd  3
0  assdsd  0
2    affd  2
5     sdb  5
4     ffb  4
1     sda  1
7      cf  7
6      db  6
8       d  8

How it working:

First get lengths to new Series:

print (df['A'].str.len())
0    6
1    3
2    4
3    6
4    3
5    3
6    2
7    2
8    1
Name: A, dtype: int64

Then get indices by sorted values by argmax, for descending ordering is used this solution:

print (df['A'].str.len().argsort())
0    8
1    6
2    7
3    1
4    4
5    5
6    2
7    0
8    3
Name: A, dtype: int64

Last change ordering by iloc:

print (df.iloc[df['A'].str.len().argsort()])
        A  B
8       d  8
6      db  6
7      cf  7
1     sda  1
4     ffb  4
5     sdb  5
2    affd  2
0  assdsd  0
3  asddsd  3

Upvotes: 1

Related Questions