Becca codes
Becca codes

Reputation: 542

Insert item into case-insensitive sorted list in Python

I have a list of strings that is already sorted in case-insensitive order. I would like to insert a new string into the list. One way to do this is to append the item and then sort the list, like so:

myList.append('Something')
myList.sort(key=lambda s: s.lower())

But I was wondering if there is a way to just insert the item in the correct position without sorting the whole thing again.

I found this question: Insert an item into a sorted list in Python. It points towards Python's bisect module. But that module doesn't look like it can support case-insensitivity.


Edit: I tested out several of the answers listed here.

It was a close call to accept an answer. In the end, I went with Stefan Pochmann's answer because it was the best for a one-time insertion, and accessing the resulting list does not require accessing a member variable. However, use cases vary, so be sure to examine all the answers.

Upvotes: 6

Views: 2164

Answers (6)

Stefan Pochmann
Stefan Pochmann

Reputation: 28636

I see you added test results to you question. I did some benchmarks as well now and get a similar picture:

Insorting 20000 words:
 80.224 seconds with insort_sorting
  0.166 seconds with insort_own_binary_search
 70.294 seconds with insort_lower_all
  0.094 seconds with insort_keep_lower

You're somewhat wrong about the fast two, though. With larger numbers of insertions, mine becomes faster. About twice as fast:

Insorting 1000000 words:
 92.712 seconds with insort_own_binary_search
173.577 seconds with insort_keep_lower

That's because the O(log n) time for searching the index becomes negligible and the time is dominated by the O(n) time for the insert calls. And my solution has only one of those while the other solution has two.

Another difference is space complexity, keeping an extra list of lowered versions of all strings isn't as good.

Here's my benchmarking code:

import random, string, time

#--------------------------------------------------------------
def insort_sorting(a, x):
    a.append(x)
    a.sort(key=str.lower)
#--------------------------------------------------------------
def insort_own_binary_search(a, x):
    key = x.lower()
    lo, hi = 0, len(myList)
    while lo < hi:
        mid = (lo + hi) // 2
        if key < a[mid].lower():
            hi = mid
        else:
            lo = mid + 1
    a.insert(lo, x)
#--------------------------------------------------------------
from bisect import bisect
def insort_lower_all(a, x):
    index = bisect([i.lower() for i in a], x.lower())
    a.insert(index, x)
#--------------------------------------------------------------
from bisect import bisect
def insort_keep_lower(a, x, lower=[]):
    x_lower = x.lower()
    index = bisect(lower, x_lower)
    a.insert(index, x)
    lower.insert(index, x_lower)
#--------------------------------------------------------------

# Generate random words    
words = [''.join(random.choice(string.ascii_letters) for _ in range(10))
         for _ in range(20000)]
#         for _ in range(1000000)]

# Compare the solutions
print 'Insorting', len(words), 'words:'
reference = None
for insort in insort_sorting, insort_own_binary_search, insort_lower_all, insort_keep_lower:
#for insort in insort_own_binary_search, insort_keep_lower:
    t0 = time.time()
    myList = []
    for word in words:
        insort(myList, word)
    print '%7.3f seconds with %s' % (time.time() - t0, insort.__name__)
    if reference is None:
        reference = myList
    else:
        assert myList == reference

Upvotes: 1

Stefan Pochmann
Stefan Pochmann

Reputation: 28636

It's a good opportunity to practice binary search again (or just copy&paste&modify bisect.insort, which is what I did):

def insort_case_insensitive(a, x):
    key = x.lower()
    lo, hi = 0, len(myList)
    while lo < hi:
        mid = (lo + hi) // 2
        if key < a[mid].lower():
            hi = mid
        else:
            lo = mid + 1
    a.insert(lo, x)

Demo:

myList = ['a', 'b', 'c', 'd', 'e']
for x in 'A', 'B', 'C', 'D', 'E':
    insort_case_insensitive(myList, x)
print myList

Prints:

['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E']

It's O(n), just like append+sort would be, but only because of the a.insert(lo, x) at the end. Which is dead-simple and done in C, so it's super fast. The binary search of course only takes O(log n) steps, so that's super fast as well. The append+sort way would call .lower() on all elements and compare them, both of which is much slower. @MoinuddinQuadri's first solution is also much slower because of calling .lower() on all elements.

See my other answer for a benchmarking comparison.

Upvotes: 5

Jared Goguen
Jared Goguen

Reputation: 9008

You could create your own type to encapsulate this behavior (combined with bisect as suggested in another answer).

from bisect import bisect

class CaseInsensitiveSortedList:
    def __init__(self, iterable):
        self.with_case = list(sorted(iterable, key=lambda s: s.lower()))
        self.without_case = [s.lower() for s in self.with_case]

    def insert_in_order(self, s):
        s_lower = s.lower()
        index = bisect(self.without_case, s_lower)
        self.without_case.insert(index, s_lower)
        self.with_case.insert(index, s)


test_list = CaseInsensitiveSortedList(['a', 'B', 'cd', 'E', 'fff'])

test_list.insert_in_order('D')
print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'fff']

test_list.insert_in_order('ee')
print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'ee', 'fff']

You could extend list directly and make this a little "more natural" or do whatever else you want with it. This is just an idea to avoid calling str.lower on every element for every insert.

Upvotes: 3

Moinuddin Quadri
Moinuddin Quadri

Reputation: 48100

Based on comment from OP, since it is OK to maintain the intermediate list to hold lower-cased strings (one time operation); it could be achieved as:

from bisect import bisect
my_list = ["aa", "bb", "Dd", "ee"]

lower_list = [i.lower() for i in my_list]  # list of lower-cased strings.
                                           # one time operation
insert_string = "CC"  # word to insert

# find index based on lower-cased list
index = bisect(lower_list, insert_string.lower())

my_list.insert(index, insert_string)  # insert word in original list
lower_list.insert(index, insert_string.lower())   # insert lower-cased word
                                                  # in lower-cased list

where final value of my_list will be lower_list will be:

>>> my_list   # original list
['aa', 'bb', 'CC', 'Dd', 'ee']

>>> lower_list
['aa', 'bb', 'cc', 'dd', 'ee']

Here we are bisecting the list of lower-cased words to find the index and based on the index, inserting the string into original list.

Upvotes: 1

gold_cy
gold_cy

Reputation: 14236

I have never worked with bisect but here is my stab at it. The first function I take directly from the bisect page that you linked:

def index(a, x):
    'Locate the leftmost value exactly equal to x'
    i = bisect_left(a, x)
    if i != len(a) and a[i] == x:
        return i
    raise ValueError

def insert_into_sorted(char, my_list):
    marker = chr(ord(char) + 1)
    spot = index(my_list, marker)
    my_list[spot:spot] = char
    return my_list

x = ['a', 'b', 'd', 'e']

insert_into_sorted('c', x)

>>['a', 'b', 'c', 'd', 'e']

Upvotes: 1

Moinuddin Quadri
Moinuddin Quadri

Reputation: 48100

You may use bisect.bisect on the lower-cased sorted list as:

from bisect import bisect
my_list = ["aa", "bb", "Dd", "ee"]
insert_string = "CC"

#                 convert all the items in list to lower case for
#               v finding the correct location via. bisect
index = bisect([i.lower() for i in my_list], insert_string.lower())
#                bisect based on lower-cased string for  ^
#                case-insensitive behavior

my_list.insert(index, insert_string)

where updated content of my_list will be:

['aa', 'bb', 'CC', 'Dd', 'ee']

Upvotes: 3

Related Questions