Reputation: 76077
I have to merge into a single sorted collection various collections that are already sorted. This operation needs to be done for not very large collections (around 5 elements) and with not many sources (maybe about 10), but the output should be calculated fast (non-blocking server).
In the case of two source collections it's pretty trivial, but when the number of source collections increases, there are different strategies to consider (n
source collections, each with m
elements):
O(n*m*log(n*m))
complexity (quicksorting n*m
elements).O(n*n*m)
complexity (scan n
heads n*m
times, that is the total number of elements).Merge(c3,Merge(c2,Merge(c1,c0)))
. I guess that it has O(n*n*m)
complexity too (do n-1
merging phases, each phase with at most n*m
elements).Merge(Merge(c3,c2),Merge(c1,c0))
. In this case we have log2(n)
levels of merging, and in each level we halve the number of collections but double the number of elements, so it should be O(log(n)*n*m)
.O(log(n)*n*m)
(the heap remove and insert operations are log(n)
and they are done n*m
times).So, the computational complexity (if I haven't made any mistakes) of the sophisticated heap-based selection of the min element every time seems to be the best. Memory-wise the heap might be more garbage-collector friendly, as it doesn't need as many intermediate temporary collections.
Have I missed something (mistakes calculating the complexities or missed any alternative way to do it)? My implementation is most likely to be done in Javascript or maybe Scala, so any runtime concerns that might apply to those execution environments are welcome!
BTW this is not related: Most efficient way to merge collections preserving order?
Upvotes: 1
Views: 1023
Reputation: 167901
tl;dr
Merge.
Full version
There's nothing like testing.
Let's suppose your underlying collection is a List
. We'll store them in an Array
.
val test = Array(
List("alpha", "beta", "gamma", "delta", "epsilon"),
List("one", "two", "three", "four", "five", "six"),
List("baby", "child", "adult", "senior"),
List("red", "yellow", "green"),
List("red", "orange", "yellow", "green", "blue", "indigo", "violet"),
List("tabby", "siamese", "manx", "persian"),
List("collie", "boxer", "bulldog", "rottweiler", "poodle", "terrier"),
List("budgie", "cockatiel", "macaw", "galah", "cockatoo"),
List("Alabama", "California", "Georgia", "Maine", "Texas", "Vermont", "Wyoming"),
List("I", "have", "to", "merge", "into")
).map(_.sorted)
Then your initial idea might be to just flatten and sort.
scala> val ans = th.pbench{ test.flatten.sorted.toList }
Benchmark (20460 calls in 183.2 ms)
Time: 8.246 us 95% CI 8.141 us - 8.351 us (n=18)
Garbage: 390.6 ns (n=1 sweeps measured)
ans: List[String] = List(Alabama, California, Georgia, I, Maine, ...)
Or you might implement a custom flatten-and-sort:
def flat(ss: Array[List[String]], i0: Int, i1: Int): Array[String] = {
var n = 0
var i = i0
while (i < i1) {
n += ss(i).length
i += 1
}
val a = new Array[String](n)
var j = 0
i = i0
while (i < i1) {
var s = ss(i)
while (s ne Nil) {
a(j) = s.head
j += 1
s = s.tail
}
i += 1
}
a
}
def mrg(ss: Array[List[String]]): List[String] = {
val a = flat(ss, 0, ss.length)
java.util.Arrays.sort(a, new java.util.Comparator[String]{
def compare(x: String, y: String) = x.compare(y)
})
a.toList
}
scala> val ans = th.pbench{ mrg(test) }
Benchmark (20460 calls in 151.7 ms)
Time: 6.883 us 95% CI 6.850 us - 6.915 us (n=18)
Garbage: 293.0 ns (n=1 sweeps measured)
ans: List[String] = List(Alabama, California, Georgia, I, Maine, ...)
Or a custom pairwise merge
def mer(s1: List[String], s2: List[String]): List[String] = {
var s3 = List.newBuilder[String]
var i1 = s1
var i2 = s2
while (true) {
if (i2.head < i1.head) {
s3 += i2.head
i2 = i2.tail
if (i2 eq Nil) {
do {
s3 += i1.head
i1 = i1.tail
} while (i1 ne Nil)
return s3.result
}
}
else {
s3 += i1.head
i1 = i1.tail
if (i1 eq Nil) {
do {
s3 += i2.head
i2 = i2.tail
} while (i2 ne Nil)
return s3.result
}
}
}
Nil // Should never get here
}
followed by a divide-and-conquer strategy
def mge(ss: Array[List[String]]): List[String] = {
var n = ss.length
val a = java.util.Arrays.copyOf(ss, ss.length)
while (n > 1) {
var i,j = 0
while (i < n) {
a(j) = if (i+1 < n) mer(a(i), a(i+1)) else a(i)
i += 2
j += 1
}
n = j
}
a(0)
}
and then you see
scala> val ans = th.pbench{ mge(test) }
Benchmark (40940 calls in 141.1 ms)
Time: 2.806 us 95% CI 2.731 us - 2.882 us (n=19)
Garbage: 146.5 ns (n=1 sweeps measured)
ans: List[String] = List(Alabama, California, Georgia, I, Maine, ...)
So, there you go. For data of the size you indicated, and for using lists (which merge very cleanly), a good bet is indeed divide-and-conquer merges. (Heaps probably won't be any better and may be worse because of the additional complexity of maintaining a heap; heapsort tends to be slower than mergesort for the same reason.)
(Note: th.pbench
is a call to my microbenchmarking utility, Thyme
.)
Some additional suggestions involve an insertion sort:
def inst(xs: Array[String]): Array[String] = {
var i = 1
while (i < xs.length) {
val x = xs(i)
var j = i
while (j > 0 && xs(j-1) > x) {
xs(j) = xs(j-1)
j -= 1
}
xs(j) = x
i += 1
}
xs
}
But these are not competitive with the merge sort either with one sweep:
scala> val ans = th.pbench( inst(flat(test, 0, test.length)).toList )
Benchmark (20460 calls in 139.2 ms)
Time: 6.601 us 95% CI 6.414 us - 6.788 us (n=19)
Garbage: 293.0 ns (n=1 sweeps measured)
ans: List[String] = List(Alabama, California, Georgia, I, Maine, ...)
or two:
scala> th.pbench( mer(inst(flat(test, 0, test.length/2)).toList,
inst(flat(test, test.length/2,test.length)).toList) )
Benchmark (20460 calls in 119.3 ms)
Time: 5.407 us 95% CI 5.244 us - 5.570 us (n=20)
Garbage: 390.6 ns (n=1 sweeps measured)
res25: List[String] = List(Alabama, California, Georgia, I, Maine,
Upvotes: 2
Reputation: 95318
Since your data set is so small, your asymptotic considerations are pretty much irrelevant (although correct) because the much more important factor will be micro-optimization. I would suggest you to compare at least the following options, in increasing order of effort needed to implement them:
O(k)
in the number of descents. For 5 lists of size 10 each, the worst case for k
is 1000. But in reality it will be much less. You could probably even prevent the worst case, for example by presorting the lists in order of increasing first element or just randomizing the order.These are general suggestions. Depending on your data set, there is likely to be a customly tailored option that is even better. For example, use counting sort if the input are small enough numbers. Maybe use radix sort for larger numbers, but that's probably not going to give you a lot of speed over insertion sort. If you have any pattern in the input, exploit it.
Upvotes: 3