Reputation: 3797
I would like to understand Timsort. Including all details, so an answer like "just call .sort()
in Java or in Python" is not what I'm looking for.
I've already read the Wikipedia article, the Java implementation, the CPython implementation, the listsort.txt that these sources mention, and the publication listsort.txt references: Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs. I've also skimmed through some of the references of Nearly-Optimal Mergesorts, namely Scaling and Related Techniques for Geometry Problems, A Unifying Look at Data Structures, and Fast Algorithms for Finding Nearest Common Ancestors; I wouldn't say I understood the latter three in full depth but I'm quite sure that they don't answer my question.
I grok most sub-algorithms: run counting, and reversing runs, binary insertion sort, merging (saving the head / tail part), galloping. Please, do not try to explain these, I know what, how, and why they do.
What I miss is the merge pattern. Actually, I understand how it works by maintaining a stack of runs and deciding when to merge based on repeatedly checking invariants; I also see that the goals are: adapting to natural runs, achieving sorting stability, balancing merges, exploiting temporal locality and keeping the algorithm simple. But I don't get why reestablishing these invariants results in an efficient algorithm.
The file listsort.txt states that (line 346):
The code now uses the "powersort" merge strategy from: "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs" J. Ian Munro and Sebastian Wild
I understand how Munro's and Wild's powersort works, and I find their explanation involving "nearly-optimal alphabetic trees" with "bisection heurestic" and "movement along the edges of Cartesian trees" sufficient.
What I can't grasp is the link between powersort's and Timsort's merge pattern.
Clearly, they are different:
I see that a higher node power is usually between shorter runs but I cannot deduce one algorithm from the other. Please explain me what is the connection between the the two.
Once I understand this, I hope I will also find the justification of the stack capacities:
/*
* Allocate runs-to-be-merged stack (which cannot be expanded). The
* stack length requirements are described in listsort.txt. The C
* version always uses the same stack length (85), but this was
* measured to be too expensive when sorting "mid-sized" arrays (e.g.,
* 100 elements) in Java. Therefore, we use smaller (but sufficiently
* large) stack lengths for smaller arrays. The "magic numbers" in the
* computation below must be changed if MIN_MERGE is decreased. See
* the MIN_MERGE declaration above for more information.
*/
int stackLen = (len < 120 ? 5 :
len < 1542 ? 10 :
len < 119151 ? 19 : 40);
How were these numbers calculated? Why is it sure that the stack won't grow beyond these? Why doesn't the code check if there is a free slot? If there isn't, then (a) expanding the stack or (b) merging at the top would easily prevent an ArrayOutOfBounds exception; but I didn't find anything like that in the code.
Upvotes: 5
Views: 363
Reputation: 65498
Please explain me what is the connection between the the two.
They instantiate a common framework: split the input list into runs, merge adjacent runs until there is only one. The different ways to do this can be summarized by binary trees whose leaves are the runs. Timsort, powersort, and peeksort (from the same paper as powersort) represent three different tree construction methods.
How were these numbers calculated?
Wrongly. (In case the link dies, that's de Gouw, Rot, de Boer, Bubel, and Hähnle: "OpenJDK's java.utils.Collection.sort() is broken: The good, the bad and the worst case".)
Upvotes: 5