Reputation: 51
I want to sort the following DataFrame on the b-column first and then the a-column.
a | b |
---|---|
0 | 1.2 |
2 | 0.07076863960397785 |
1 | 0.07076863960397783 |
4 | 0.02 |
3 | 0.07076863960397784 |
The math.isclose() function should be used to compare the floats in the b-column. Therefore, I wrote a custom compare function and use the cmp_to_key function from functools. However, when sorting the data frame I get the following error:
TypeError: object of type 'functools.KeyWrapper' has no len()
Here's my full code:
import pandas as pd
from functools import cmp_to_key
from math import isclose
import numpy as np
my_list = [
[0, 1.2],
[2, 0.07076863960397785],
[1, 0.07076863960397783],
[4, 0.02],
[3, 0.07076863960397784]
]
df = pd.DataFrame(my_list,columns=['a','b'])
def compare(a,b):
if isclose(a,b):
return 0
elif a-b<0:
return -1
else:
return 1
df.sort_values(by=['b','a'],key= cmp_to_key(compare))
Now, I know the key in sort_values expects a series, so the key function should be vectorized. But I don't know how to accomplish this.
This should be the final result:
a | b |
---|---|
4 | 0.02 |
1 | 0.07076863960397783 |
2 | 0.07076863960397785 |
3 | 0.07076863960397784 |
0 | 1.2 |
Upvotes: 3
Views: 206
Reputation: 71610
Maybe try groupby
then using apply
with pd.Series.sort_values
to sort the groups.
A solution using groupby
for grouping column b
where the groups have a 0.05
threshold. You can use cumsum
to figure out the groups for the close float values.
threshold = 0.05
df = df.sort_values('b').reset_index(drop=True)
groups = df['b'].diff().ge(threshold).cumsum()
df.iloc[df.groupby(groups)['a'].apply(pd.Series.sort_values).reset_index(level=0).index]
A solution for round
ing up the floats. I am using round(6)
to round the decimal points.
Code:
df['b'] = df['b'].round(6)
df.groupby('b')['a'].apply(pd.Series.sort_values).reset_index(level=0)[['a', 'b']]
Output:
a b
3 4 0.020000
2 1 0.070769
1 2 0.070769
4 3 0.070769
0 0 1.200000
Upvotes: 1
Reputation: 7540
Edit: see comments by @mozway and @Eric Postpischil for caveats of both solutions.
'b'
before sorting in a copy of the dataframe, then reindex the original dataframe with the indices from the copy:sorted = df.copy()
sorted["b"] = sorted["b"].round(2)
sorted = sorted.sort_values(["b", "a"])
df = df.reindex(sorted.index)
a b
3 4 0.020000
2 1 0.070769
1 2 0.070769
4 3 0.070769
0 0 1.200000
sort_values
, it should return a pd.Series
:def custom_key(series: pd.Series):
sorted_indices = series.argsort()
grouped_series = np.zeros_like(series)
current_group = 0
for i in sorted_indices:
if grouped_series[i] == 0:
current_group += 1
grouped_series[i] = current_group
grouped_series[np.isclose(series[i], series)] = current_group
return grouped_series
df = df.sort_values(by=["b", "a"], key=custom_key)
a b
3 4 0.020000
2 1 0.070769
1 2 0.070769
4 3 0.070769
0 0 1.200000
Upvotes: 1
Reputation: 13457
To find runs of values that are around each other without truncation, one could pre-sort the arrays, diff
them, and apply some thresholding. This has a compounded side-effect if you have multiple adjacent float values who are just subthreshold.
To exemplify this point, how many groups should be in the following array?
threshold = 0.05
array = [1.0, 1.02, 1.04, 1.08, 1.5, 1.7]
Obviously 1.5
and 1.7
are outside of our threshold so they each belong in their own sorting groups, and 1.0, 1.02, 1.04
should be grouped together because their difference are all below threshold. But where does 1.08
belong? An adjacent search says that it should belong in the same group as 1.04
, which may be counter-intuitive because the difference between 1.0
and 1.08
exceed the threshold.
If the above constraint is fine, then the following code will work:
import pandas as pd
import numpy as np
my_list = [
[0, 1.2],
[2, 0.07076863960397785],
[1, 0.07076863960397783],
[4, 0.02],
[3, 0.07076863960397784]
]
df = pd.DataFrame(my_list,columns=['a','b'])
threshold = 1e-6 # floats smaller than threshold will be grouped together
grouped_floats = df.apply(lambda s: s.sort_values().diff().gt(threshold).cumsum())
print(
df.reindex(grouped_floats.sort_values(by=['b', 'a']).index)
)
# a b
# 3 4 0.020000
# 2 1 0.070769
# 1 2 0.070769
# 4 3 0.070769
# 0 0 1.200000
Upvotes: 1
Reputation: 4949
You can use sort_values()
with np.argsort()
:
import pandas as pd
import numpy as np
from math import isclose
from functools import cmp_to_key
compare = lambda a, b: 0 if isclose(a, b) else (-1 if a < b else 1)
def _sort(df):
sorted_indices = sorted(range(len(df)), key=cmp_to_key(lambda i, j: compare(df.at[i, 'b'], df.at[j, 'b'])))
df = df.iloc[sorted_indices].reset_index(drop=True)
df = df.sort_values(by=['b', 'a'], key=lambda col: col if col.name == 'a' else np.argsort(col))
return df
my_list = [
[0, 1.2],
[2, 0.07076863960397785],
[1, 0.07076863960397783],
[4, 0.02],
[3, 0.07076863960397784]
]
df = pd.DataFrame(my_list, columns=['a', 'b'])
print(_sort(df))
a b
0 4 0.020000
3 3 0.070769
1 2 0.070769
2 1 0.070769
4 0 1.200000
Upvotes: 1