Yannick
Yannick

Reputation: 51

Sort pandas dataframe float column with custom key

I want to sort the following DataFrame on the b-column first and then the a-column.

a b
0 1.2
2 0.07076863960397785
1 0.07076863960397783
4 0.02
3 0.07076863960397784

The math.isclose() function should be used to compare the floats in the b-column. Therefore, I wrote a custom compare function and use the cmp_to_key function from functools. However, when sorting the data frame I get the following error:

TypeError: object of type 'functools.KeyWrapper' has no len()

Here's my full code:

import pandas as pd
from functools import cmp_to_key
from math import isclose
import numpy as np

my_list = [
[0, 1.2],
[2, 0.07076863960397785],
[1, 0.07076863960397783],
[4, 0.02],
[3, 0.07076863960397784]
]

df = pd.DataFrame(my_list,columns=['a','b'])

def compare(a,b):
  if isclose(a,b):
    return 0
  elif a-b<0:
    return -1
  else:
    return 1

df.sort_values(by=['b','a'],key= cmp_to_key(compare))

Now, I know the key in sort_values expects a series, so the key function should be vectorized. But I don't know how to accomplish this.

This should be the final result:

a b
4 0.02
1 0.07076863960397783
2 0.07076863960397785
3 0.07076863960397784
0 1.2

Upvotes: 3

Views: 206

Answers (4)

U13-Forward
U13-Forward

Reputation: 71610

Maybe try groupby then using apply with pd.Series.sort_values to sort the groups.

A solution using groupby for grouping column b where the groups have a 0.05 threshold. You can use cumsum to figure out the groups for the close float values.

threshold = 0.05

df = df.sort_values('b').reset_index(drop=True)
groups = df['b'].diff().ge(threshold).cumsum()

df.iloc[df.groupby(groups)['a'].apply(pd.Series.sort_values).reset_index(level=0).index]

A solution for rounding up the floats. I am using round(6) to round the decimal points.

Code:

df['b'] = df['b'].round(6)

df.groupby('b')['a'].apply(pd.Series.sort_values).reset_index(level=0)[['a', 'b']]

Output:

   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

Upvotes: 1

e-motta
e-motta

Reputation: 7540

Edit: see comments by @mozway and @Eric Postpischil for caveats of both solutions.


  1. You can round the values in column 'b' before sorting in a copy of the dataframe, then reindex the original dataframe with the indices from the copy:
sorted = df.copy()
sorted["b"] = sorted["b"].round(2)
sorted = sorted.sort_values(["b", "a"])

df = df.reindex(sorted.index)
   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

  1. If you want to use a key function to sort_values, it should return a pd.Series:
def custom_key(series: pd.Series):
    sorted_indices = series.argsort()
    grouped_series = np.zeros_like(series)
    current_group = 0

    for i in sorted_indices:
        if grouped_series[i] == 0:
            current_group += 1
            grouped_series[i] = current_group
            grouped_series[np.isclose(series[i], series)] = current_group

    return grouped_series


df = df.sort_values(by=["b", "a"], key=custom_key)
   a         b
3  4  0.020000
2  1  0.070769
1  2  0.070769
4  3  0.070769
0  0  1.200000

Upvotes: 1

Cameron Riddell
Cameron Riddell

Reputation: 13457

To find runs of values that are around each other without truncation, one could pre-sort the arrays, diff them, and apply some thresholding. This has a compounded side-effect if you have multiple adjacent float values who are just subthreshold.

To exemplify this point, how many groups should be in the following array?

threshold = 0.05
array = [1.0, 1.02, 1.04, 1.08, 1.5, 1.7]

Obviously 1.5 and 1.7 are outside of our threshold so they each belong in their own sorting groups, and 1.0, 1.02, 1.04 should be grouped together because their difference are all below threshold. But where does 1.08 belong? An adjacent search says that it should belong in the same group as 1.04, which may be counter-intuitive because the difference between 1.0 and 1.08 exceed the threshold.

If the above constraint is fine, then the following code will work:

import pandas as pd
import numpy as np

my_list = [
    [0, 1.2],
    [2, 0.07076863960397785],
    [1, 0.07076863960397783],
    [4, 0.02],
    [3, 0.07076863960397784]
]

df = pd.DataFrame(my_list,columns=['a','b'])

threshold = 1e-6 # floats smaller than threshold will be grouped together
grouped_floats = df.apply(lambda s: s.sort_values().diff().gt(threshold).cumsum())
print(
    df.reindex(grouped_floats.sort_values(by=['b', 'a']).index)
)
#    a         b
# 3  4  0.020000
# 2  1  0.070769
# 1  2  0.070769
# 4  3  0.070769
# 0  0  1.200000

Upvotes: 1

user24714692
user24714692

Reputation: 4949

You can use sort_values() with np.argsort():

import pandas as pd
import numpy as np
from math import isclose
from functools import cmp_to_key


compare = lambda a, b: 0 if isclose(a, b) else (-1 if a < b else 1)


def _sort(df):
    sorted_indices = sorted(range(len(df)), key=cmp_to_key(lambda i, j: compare(df.at[i, 'b'], df.at[j, 'b'])))
    df = df.iloc[sorted_indices].reset_index(drop=True)
    df = df.sort_values(by=['b', 'a'], key=lambda col: col if col.name == 'a' else np.argsort(col))
    return df


my_list = [
    [0, 1.2],
    [2, 0.07076863960397785],
    [1, 0.07076863960397783],
    [4, 0.02],
    [3, 0.07076863960397784]
]

df = pd.DataFrame(my_list, columns=['a', 'b'])

print(_sort(df))

Prints

   a         b
0  4  0.020000
3  3  0.070769
1  2  0.070769
2  1  0.070769
4  0  1.200000

Upvotes: 1

Related Questions