Spinlock
Spinlock

Reputation: 31

Sort a list in Python, ignore blanks and case

I have a list (of dictionary keys), which I need to sort. This is my list:

listToBeSorted = ["Right  Coronary Artery 2", "Right Coronary Artery 1", "RIght Coronary Artery 3"]

Obviously, the order in which I'd like to have these items sorted would be:

["Right Coronary Artery 1", "Right  Coronary Artery 2", "RIght Coronary Artery 3"]

So I need to find a way to sort, ignoring the double blanks in the first item, and the uppercase "I" in the last item.

I tried the following sorting mechanisms:

  1. Plain sorting

    sortedList = sorted(listToBeSorted)
    

    will produce:

    ['RIght Coronary Artery 3',
     'Right  Coronary Artery 2',
     'Right Coronary Artery 1']
    
  2. Sorting, ignoring case:

    sortedList = sorted(listToBeSorted, key=str.casefold)
    

    will produce:

    ['Right  Coronary Artery 2',
     'Right Coronary Artery 1',
     'RIght Coronary Artery 3']
    
  3. Sorting, eliminating all blanks

    sortedList = sorted(listToBeSorted, key=lambda x: ''.join(x.split()))
    

    will produce:

    ['RIght Coronary Artery 3',
     'Right Coronary Artery 1',
     'Right  Coronary Artery 2']
    

I cannot change the entries themselves, as I need them to access the items in a dictionary later.

I eventually converted the list entries into a tuple, added an uppercase version without blanks, and sorted the list by the 2nd element of the tuple:

sortedListWithTwin = []
    
# Add an uppercase "twin" without whitespaces
for item in listToBeSorted:
  sortString = (item.upper()).replace(" ","")
  sortedListWithTwin.append((item, sortString))
       
# Sort list by the new "twin"
sortedListWithTwin.sort(key = lambda x: x[1])
    
# Remove the twin
sortedList = []
for item in sortedListWithTwin:
  sortedList.append(item[0])

This will produce the desired order:

['Right Coronary Artery 1',
 'Right  Coronary Artery 2',
 'RIght Coronary Artery 3']

However, this solution seems very cumbersome and inefficient. What would be a better way to solve this?

Upvotes: 3

Views: 328

Answers (3)

Andj
Andj

Reputation: 1447

I'll give an alternative method, using PyICU (a Python wrapper for icu4c). ICU has quite a powerful and flexible Collator class to allow tailored sorting.

I will include two methods:

  1. Create a collator instance for the locale you wish to use, and set the collators attributes.
  2. Create a collator instance using a BCP 47 langauge tag with appropriate U extension settings.

For the question solution, I would activate numeric collation, set collation strength to secondary (case distinctions are tertiary, so setting to secondary will give us a caseless sort). Set alternate handling to shifted, this will address the whitespace issue in the question.

Setting attributes on collator

import icu
listToBeSorted = ["Right  Coronary Artery 2", "Right Coronary Artery 1", "RIght Coronary Artery 3"]
collator = icu.Collator.createInstance(icu.Locale.getRoot())
collator.setAttribute(icu.UCollAttribute.NUMERIC_COLLATION, icu.UCollAttributeValue.ON)
collator.setStrength(icu.UCollAttributeValue.SECONDARY)
collator.setAttribute(icu.UCollAttribute.ALTERNATE_HANDLING, icu.UCollAttributeValue.SHIFTED)
sorted(listToBeSorted, key=collator.getSortKey)

Creating locale from BCP-47 language tag

import icu
listToBeSorted = ["Right  Coronary Artery 2", "Right Coronary Artery 1", "RIght Coronary Artery 3"]
lang = "en-AU-u-kn-true-ka-shifted-kv-space-ks-level2"
loc = icu.Locale.forLanguageTag(lang)
collator = icu.Collator.createInstance(loc)
sorted(listToBeSorted, key=collator2.getSortKey)

Both will result in ['Right Coronary Artery 1', 'Right Coronary Artery 2', 'RIght Coronary Artery 3']

In the BCP-47 version, I have restricted the alternative handling shift to just whitespace. Alternatively, punctuation, symbols and currency symbols could have been included.

Upvotes: 0

Jamiu S.
Jamiu S.

Reputation: 5741

sort using lambda

sortedList = sorted(listToBeSorted, key=lambda x: x.casefold().replace(" ", ""))
print(sortedList)

If you don't want to use replace for some reason. You could even use regex.
re.sub() function will replace all the whitespaces characters with an empty string. \s+ matches one or more consecutive whitespaces. Maintaining casefold() function to ignore case.

import re

sortedList = sorted(listToBeSorted, key=lambda x: re.sub(r"\s+", "", x.casefold()))
print(sortedList)

Output:

['Right Coronary Artery 1', 
'Right Coronary Artery 2', 
'RIght Coronary Artery 3']

Upvotes: 4

Talha Tayyab
Talha Tayyab

Reputation: 27750

sortedList = sorted(listToBeSorted, key=lambda x: x.upper().replace(" ", ""))
print(sortedList)

print(sortedList)
#['Right Coronary Artery 1', 'Right  Coronary Artery 2', 'RIght Coronary Artery 3']

Upvotes: -1

Related Questions