Algorithm to sort X number in batches of Y

Question

Could somebody direct me to an algorithm that I can use to sort X number in batches of Y. Meaning that you can only compare Y numbers at the same time, but you can do that multiple times.

E.g. There are X=100 statements and a respondent must sort them according to how relevant they are to her in such a way that she will only see and sort Y=9 statements at a time, but will do that multiple times.

btilly · Accepted Answer

From your hypothetical, I believe you are willing to do a lot of work to figure out the next comparison set (because that is done by computer), and would like as few comparisons as possible (because that is a human).

So the idea of the approach that I will outline is a greedy heuristic that attempts to maximize how much information each comparison gives us. It is complicated, but should do very well.

The first thing we need is how to measure information. Here is the mathematical theory. Suppose that we have a biased coin with a probability p of coming up heads. The information in it comings up heads is - log2(p). The information in it coming up tails is - log2(1-p). (Note that log of a number between 0 and 1 is negative, and the negative of a negative is positive. So information is always positive.) If you use an efficient encoding and have many flips to encode, the sum of the information of a sequence of flips is how many bits you need to send to communicate it.

The expected information of a single flip is therefore - p log2(p) - (1-p) log2(1-p).

So the idea is to pick a comparison set such that sorting it gives us as much information as possible about the final sort that we don't already have. But how do we estimate how much is not known about a particular pair? For example if I sort 2 groups of 5, the top of one group is unlikely to be less than the bottom of the other. It could be, but there is much less information in that comparison than comparing the two middle elements with each other. How do we capture that?

My idea for how to do that is to do a series of topological sorts to get a sense. In particular you do the first topological sort randomly. The second topological sort you try to make as different as possible by, at every choice, choosing the element which had largest rank the last time. The third topological sort you choose the element whose sum of ranks in the previous sorts was as large as possible. And so on. Do this 20x or so.

Now for any pair of elements we can just look at how often they disagree in our sorts to estimate a probability that one is really larger than the other. We can turn that into an expected entropy with the formula from before.

So we start the comparison set with the element with the largest difference between its maximum and minimum rank in the sorts.

The second element is the one that has the highest entropy with the first, breaking ties by the largest difference between its minimum and maximum rank in the sorts.

The third is the one whose sum of entropies with the first two is the most, again breaking ties in the same way.

The exact logic that the algorithm will follow is, of course, randomized. In fact you're doing O(k^2 n) work per comparison set that you find. But it will on average finish with surprisingly few comparison sets.

I don't have a proof, but I suspect that you will on average only need the theoretically optimal O(log(n!) / log(k!)) = O(n log(n) / (k log(k))) comparisons. For k=2 my further suspicion is that it will give a solution that is on average more efficient than merge sort.

Algorithm to sort X number in batches of Y

Answers (2)

Related Questions