t3chb0t
t3chb0t

Reputation: 18694

Sort underscore before upercase letters

I have a DataFrame with UPPERCASE column names from a database that I'd like to sort. Some of them are prefixed with an underscore _. The default sorting puts them at the end. Do I have to convert them to lowercase, then sort, then convert them back to uppercase or is there an easier way?

Here's some sample data. When the values of cl2 are lowercase then the order is correct. With uppercase not so much.

pd.DataFrame({
   "cl1": ["foo", "foo", "foo"], 
   "cl2": ["_FOO", "BAR", "BAZ"]}
).sort_values(["cl1", "cl2"])

Upvotes: 1

Views: 427

Answers (2)

JGFMK
JGFMK

Reputation: 8904

I had similar peculiar results.

For me, I had a list of tuples I was trying to sort on a compound key.

I ended up using replace to remove underscores in the sort itself as follows:

  dl = sorted(dl, key=lambda t: (t[0].replace('_',''),t[1]))

Otherwise I was getting odd results like:

  • SAPHIRE

coming before:

  • SAP_AG

It's bizarre how Python sort works.

I tried both of the results here in conjunction with here to come up with this.

Upvotes: 1

mozway
mozway

Reputation: 261914

If conversion to lowercase works for you, you can use it as a key:

pd.DataFrame({
   "cl1": ["foo", "foo", "foo"], 
   "cl2": ["_FOO", "BAR", "BAZ"]}
).sort_values(["cl1", "cl2"], key=lambda s: s.str.lower())

Other option, sort by the strings without _, then the presence of _:

(pd.DataFrame({
   "cl1": ["foo", "foo", "foo"], 
   "cl2": ["_FOO", "BAR", "BAZ"]})
   .sort_values(["cl1", "cl2"], key=lambda s: s.str.lstrip('_'))
   .sort_values(["cl1", "cl2"], key=lambda s: ~s.str.startswith('_'), kind='stable')
)

output:

   cl1   cl2
0  foo  _FOO
1  foo   BAR
2  foo   BAZ

Upvotes: 1

Related Questions