dingo_d
dingo_d

Reputation: 11670

python heapq sorting list wrong?

I am trying to sort lists into one list that contain numbers and names of sections, sub sections and sub sub sections. The program looks like this:

import heapq

sections = ['1. Section', '2. Section', '3. Section', '4. Section', '5. Section', '6. Section', '7. Section', '8. Section', '9. Section', '10. Section', '11. Section', '12. Section']
subsections = ['1.1 Subsection', '1.2 Subsection', '1.3 Subsection', '1.4 Subsection', '2.1 Subsection', '4.1 My subsection', '7.1 Subsection', '8.1 Subsection', '12.1 Subsection']
subsubsections = ['1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.4.1 Subsubsection', '2.1.1 Subsubsection', '7.1.1 Subsubsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection']

sorted_list = list(heapq.merge(sections, subsections, subsubsections))

print(sorted_list)

What I get out is this:

['1. Section', '1.1 Subsection', '1.2 Subsection', '1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.3 Subsection', '1.4 Subsection', '1.4.1 Subsubsection', '2. Section', '2.1 Subsection', '2.1.1 Subsubsection', '3. Section', '4. Section', '4.1 My subsection', '5. Section', '6. Section', '7. Section', '7.1 Subsection', '7.1.1 Subsubsection', '8. Section', '8.1 Subsection', '12.1 Subsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection', '9. Section', '10. Section', '11. Section', '12. Section']

My 12th subsection, and sub sub section is located within 8th section, not 12th.

Why is this happening? The original lists are sorted, and it all goes good, apparently up to number 10.

I'm not sure why this is happening and is there a way to better sort this into a 'tree' based on the numbers in the lists? I'm building a table of contents of sorts, and this will return (once I filter the list out)

1. Section
    1.1 Subsection
    1.2 Subsection
        1.2.1 Subsubsection
        1.2.2 Subsubsection
    1.3 Subsection
    1.4 Subsection
        1.4.1 Subsubsection
2. Section
    2.1 Subsection
        2.1.1 Subsubsection
3. Section
4. Section
    4.1 My subsection
5. Section
6. Section
7. Section
    7.1 Subsection
        7.1.1 Subsubsection
8. Section
    8.1 Subsection
    12.1 Subsection
        8.1.1 Subsubsection
        12.1.1 Subsubsection
9. Section
10. Section
11. Section
12. Section

Notice the 12.1 Subsection behind 8.1 Subsection and 12.1.1 Subsubsection behind 8.1.1 Subsubsection.

Upvotes: 1

Views: 1210

Answers (2)

Kasravnd
Kasravnd

Reputation: 107287

As explained in other answer you have to specify a sorting method, otherwise python will sort the strings lexicographically. If you are using python 3.5+ you can use key argument in merge function, in python 3.5- you can use itertools.chain and sorted, and as a general approach you can use regex in order to find the numbers and convert them to int :

In [18]: from itertools import chain
In [19]: import re
In [23]: sorted(chain.from_iterable((sections, subsections, subsubsections)),
                key = lambda x: [int(i) for i in re.findall(r'\d+', x)])
Out[23]: 
['1. Section',
 '1.1 Subsection',
 '1.2 Subsection',
 '1.2.1 Subsubsection',
 '1.2.2 Subsubsection',
 '1.3 Subsection',
 '1.4 Subsection',
 '1.4.1 Subsubsection',
 '2. Section',
 '2.1 Subsection',
 '2.1.1 Subsubsection',
 '3. Section',
 '4. Section',
 '4.1 My subsection',
 '5. Section',
 '6. Section',
 '7. Section',
 '7.1 Subsection',
 '7.1.1 Subsubsection',
 '8. Section',
 '8.1 Subsection',
 '8.1.1 Subsubsection',
 '9. Section',
 '10. Section',
 '11. Section',
 '12. Section',
 '12.1 Subsection',
 '12.1.1 Subsubsection']

Upvotes: 3

Martijn Pieters
Martijn Pieters

Reputation: 1121784

Your lists may appear sorted, to a human eye. But to Python, your inputs are not fully sorted, because it sorts strings lexicographically. That means that '12' comes before '8' in sorted order, because only the first characters are compared.

As such, the merge is completely correct; the string starting '12.1' is encountered after the '8.1' string was seen, but the one starting with '8.1.1' is sorted afterwards.

You'll have to extract tuples of integers from the strings with a key function to sort correctly:

section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
heapq.merge(sections, subsections, subsubsections, key=section))

Note that the key argument is only available in Python 3.5 and up; you'd have to do a manual decorate-merge-undecorate dance in earlier versions.

Demo (using Python 3.6):

>>> section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
>>> sorted_list = list(heapq.merge(sections, subsections, subsubsections, key=section))
>>> from pprint import pprint
>>> pprint(sorted_list)
['1. Section',
 '1.1 Subsection',
 '1.2 Subsection',
 '1.2.1 Subsubsection',
 '1.2.2 Subsubsection',
 '1.3 Subsection',
 '1.4 Subsection',
 '1.4.1 Subsubsection',
 '2. Section',
 '2.1 Subsection',
 '2.1.1 Subsubsection',
 '3. Section',
 '4. Section',
 '4.1 My subsection',
 '5. Section',
 '6. Section',
 '7. Section',
 '7.1 Subsection',
 '7.1.1 Subsubsection',
 '8. Section',
 '8.1 Subsection',
 '8.1.1 Subsubsection',
 '9. Section',
 '10. Section',
 '11. Section',
 '12. Section',
 '12.1 Subsection',
 '12.1.1 Subsubsection']

The keyed merge is easily backported to Python 3.3 and 3.4:

import heapq

def _heappop_max(heap):
    lastelt = heap.pop()
    if heap:
        returnitem = heap[0]
        heap[0] = lastelt
        heapq._siftup_max(heap, 0)
        return returnitem
    return lastelt

def _heapreplace_max(heap, item):
    returnitem = heap[0]
    heap[0] = item
    heapq._siftup_max(heap, 0)
    return returnitem

def merge(*iterables, key=None, reverse=False):    
    h = []
    h_append = h.append

    if reverse:
        _heapify = heapq._heapify_max
        _heappop = _heappop_max
        _heapreplace = _heapreplace_max
        direction = -1
    else:
        _heapify = heapify
        _heappop = heappop
        _heapreplace = heapreplace
        direction = 1

    if key is None:
        for order, it in enumerate(map(iter, iterables)):
            try:
                next = it.__next__
                h_append([next(), order * direction, next])
            except StopIteration:
                pass
        _heapify(h)
        while len(h) > 1:
            try:
                while True:
                    value, order, next = s = h[0]
                    yield value
                    s[0] = next()           # raises StopIteration when exhausted
                    _heapreplace(h, s)      # restore heap condition
            except StopIteration:
                _heappop(h)                 # remove empty iterator
        if h:
            # fast case when only a single iterator remains
            value, order, next = h[0]
            yield value
            yield from next.__self__
        return

    for order, it in enumerate(map(iter, iterables)):
        try:
            next = it.__next__
            value = next()
            h_append([key(value), order * direction, value, next])
        except StopIteration:
            pass
    _heapify(h)
    while len(h) > 1:
        try:
            while True:
                key_value, order, value, next = s = h[0]
                yield value
                value = next()
                s[0] = key(value)
                s[2] = value
                _heapreplace(h, s)
        except StopIteration:
            _heappop(h)
    if h:
        key_value, order, value, next = h[0]
        yield value
        yield from next.__self__

A decorate-sort-undecorate merge is as simple as:

def decorate(iterable, key):
    for elem in iterable:
        yield key(elem), elem

sorted = [v for k, v in heapq.merge(
    decorate(sections, section), decorate(subsections, section)
    decorate(subsubsections, section))]

Because your input is already sorted, using a merge sort is more efficient. As a last resort, you could just use sorted() however:

from itertools import chain
result = sorted(chain(sections, subsections, subsubsections), key=section)

Upvotes: 4

Related Questions