Reputation: 11670
I am trying to sort lists into one list that contain numbers and names of sections, sub sections and sub sub sections. The program looks like this:
import heapq
sections = ['1. Section', '2. Section', '3. Section', '4. Section', '5. Section', '6. Section', '7. Section', '8. Section', '9. Section', '10. Section', '11. Section', '12. Section']
subsections = ['1.1 Subsection', '1.2 Subsection', '1.3 Subsection', '1.4 Subsection', '2.1 Subsection', '4.1 My subsection', '7.1 Subsection', '8.1 Subsection', '12.1 Subsection']
subsubsections = ['1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.4.1 Subsubsection', '2.1.1 Subsubsection', '7.1.1 Subsubsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection']
sorted_list = list(heapq.merge(sections, subsections, subsubsections))
print(sorted_list)
What I get out is this:
['1. Section', '1.1 Subsection', '1.2 Subsection', '1.2.1 Subsubsection', '1.2.2 Subsubsection', '1.3 Subsection', '1.4 Subsection', '1.4.1 Subsubsection', '2. Section', '2.1 Subsection', '2.1.1 Subsubsection', '3. Section', '4. Section', '4.1 My subsection', '5. Section', '6. Section', '7. Section', '7.1 Subsection', '7.1.1 Subsubsection', '8. Section', '8.1 Subsection', '12.1 Subsection', '8.1.1 Subsubsection', '12.1.1 Subsubsection', '9. Section', '10. Section', '11. Section', '12. Section']
My 12th subsection, and sub sub section is located within 8th section, not 12th.
Why is this happening? The original lists are sorted, and it all goes good, apparently up to number 10.
I'm not sure why this is happening and is there a way to better sort this into a 'tree' based on the numbers in the lists? I'm building a table of contents of sorts, and this will return (once I filter the list out)
1. Section
1.1 Subsection
1.2 Subsection
1.2.1 Subsubsection
1.2.2 Subsubsection
1.3 Subsection
1.4 Subsection
1.4.1 Subsubsection
2. Section
2.1 Subsection
2.1.1 Subsubsection
3. Section
4. Section
4.1 My subsection
5. Section
6. Section
7. Section
7.1 Subsection
7.1.1 Subsubsection
8. Section
8.1 Subsection
12.1 Subsection
8.1.1 Subsubsection
12.1.1 Subsubsection
9. Section
10. Section
11. Section
12. Section
Notice the 12.1 Subsection behind 8.1 Subsection and 12.1.1 Subsubsection behind 8.1.1 Subsubsection.
Upvotes: 1
Views: 1210
Reputation: 107287
As explained in other answer you have to specify a sorting method, otherwise python will sort the strings lexicographically. If you are using python 3.5+ you can use key
argument in merge
function, in python 3.5- you can use itertools.chain
and sorted
, and as a general approach you can use regex in order to find the numbers and convert them to int :
In [18]: from itertools import chain
In [19]: import re
In [23]: sorted(chain.from_iterable((sections, subsections, subsubsections)),
key = lambda x: [int(i) for i in re.findall(r'\d+', x)])
Out[23]:
['1. Section',
'1.1 Subsection',
'1.2 Subsection',
'1.2.1 Subsubsection',
'1.2.2 Subsubsection',
'1.3 Subsection',
'1.4 Subsection',
'1.4.1 Subsubsection',
'2. Section',
'2.1 Subsection',
'2.1.1 Subsubsection',
'3. Section',
'4. Section',
'4.1 My subsection',
'5. Section',
'6. Section',
'7. Section',
'7.1 Subsection',
'7.1.1 Subsubsection',
'8. Section',
'8.1 Subsection',
'8.1.1 Subsubsection',
'9. Section',
'10. Section',
'11. Section',
'12. Section',
'12.1 Subsection',
'12.1.1 Subsubsection']
Upvotes: 3
Reputation: 1121784
Your lists may appear sorted, to a human eye. But to Python, your inputs are not fully sorted, because it sorts strings lexicographically. That means that '12'
comes before '8'
in sorted order, because only the first characters are compared.
As such, the merge is completely correct; the string starting '12.1'
is encountered after the '8.1'
string was seen, but the one starting with '8.1.1'
is sorted afterwards.
You'll have to extract tuples of integers from the strings with a key function to sort correctly:
section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
heapq.merge(sections, subsections, subsubsections, key=section))
Note that the key
argument is only available in Python 3.5 and up; you'd have to do a manual decorate-merge-undecorate dance in earlier versions.
Demo (using Python 3.6):
>>> section = lambda s: [int(d) for d in s.partition(' ')[0].split('.') if d]
>>> sorted_list = list(heapq.merge(sections, subsections, subsubsections, key=section))
>>> from pprint import pprint
>>> pprint(sorted_list)
['1. Section',
'1.1 Subsection',
'1.2 Subsection',
'1.2.1 Subsubsection',
'1.2.2 Subsubsection',
'1.3 Subsection',
'1.4 Subsection',
'1.4.1 Subsubsection',
'2. Section',
'2.1 Subsection',
'2.1.1 Subsubsection',
'3. Section',
'4. Section',
'4.1 My subsection',
'5. Section',
'6. Section',
'7. Section',
'7.1 Subsection',
'7.1.1 Subsubsection',
'8. Section',
'8.1 Subsection',
'8.1.1 Subsubsection',
'9. Section',
'10. Section',
'11. Section',
'12. Section',
'12.1 Subsection',
'12.1.1 Subsubsection']
The keyed merge is easily backported to Python 3.3 and 3.4:
import heapq
def _heappop_max(heap):
lastelt = heap.pop()
if heap:
returnitem = heap[0]
heap[0] = lastelt
heapq._siftup_max(heap, 0)
return returnitem
return lastelt
def _heapreplace_max(heap, item):
returnitem = heap[0]
heap[0] = item
heapq._siftup_max(heap, 0)
return returnitem
def merge(*iterables, key=None, reverse=False):
h = []
h_append = h.append
if reverse:
_heapify = heapq._heapify_max
_heappop = _heappop_max
_heapreplace = _heapreplace_max
direction = -1
else:
_heapify = heapify
_heappop = heappop
_heapreplace = heapreplace
direction = 1
if key is None:
for order, it in enumerate(map(iter, iterables)):
try:
next = it.__next__
h_append([next(), order * direction, next])
except StopIteration:
pass
_heapify(h)
while len(h) > 1:
try:
while True:
value, order, next = s = h[0]
yield value
s[0] = next() # raises StopIteration when exhausted
_heapreplace(h, s) # restore heap condition
except StopIteration:
_heappop(h) # remove empty iterator
if h:
# fast case when only a single iterator remains
value, order, next = h[0]
yield value
yield from next.__self__
return
for order, it in enumerate(map(iter, iterables)):
try:
next = it.__next__
value = next()
h_append([key(value), order * direction, value, next])
except StopIteration:
pass
_heapify(h)
while len(h) > 1:
try:
while True:
key_value, order, value, next = s = h[0]
yield value
value = next()
s[0] = key(value)
s[2] = value
_heapreplace(h, s)
except StopIteration:
_heappop(h)
if h:
key_value, order, value, next = h[0]
yield value
yield from next.__self__
A decorate-sort-undecorate merge is as simple as:
def decorate(iterable, key):
for elem in iterable:
yield key(elem), elem
sorted = [v for k, v in heapq.merge(
decorate(sections, section), decorate(subsections, section)
decorate(subsubsections, section))]
Because your input is already sorted, using a merge sort is more efficient. As a last resort, you could just use sorted()
however:
from itertools import chain
result = sorted(chain(sections, subsections, subsubsections), key=section)
Upvotes: 4