Victor Moroz
Victor Moroz

Reputation: 9225

Sorting arrays in Scala: performance issues

val a = new Array[(Int, Int)](250000000)
... // initialization here

// #1
val b = a.sortBy(_._1)
// This part completely kills GC, 
// and I allocate plenty of heap memory (30GB)
// more than it's trying to use

// #2
val b = a.sorted
// Kills GC as efficiently as #1

// #3
val b = a.sortWith(_._1 < _._1)
// This part doesn't kill GC, but
// takes long to complete (224s)
// and still requires 10GB of heap

// #4
val a = new Array[Long](250000000)
java.util.Arrays.sort(a)
// Alternative version
// which takes only 2.7GB
// and completes in 40 secs

By killing GC I mean it's working like crazy using all 16 cores (I know I can reduce # of cores for GC, but it doesn't solve the problem)

Okay, I understand there is an overhead for #3 (objects, immutable operation), not sure it should take 5x amount of time and 4x amount of heap, but it's the price I guess. But I'm puzzled with #1 and #2. Should I avoid implicit ordering as a plague? What's happening there? I'm probably doing it wrong?

Scala version: 2.12.4

Upvotes: 2

Views: 315

Answers (1)

Jacek L.
Jacek L.

Reputation: 1416

Those methods requires it to create a copy of that array with the mapped values, then apply sorting with a custom comparator. Arrays.sort over integral values is much smarter - it can compare number directly and do not have to invoke any method to compare them.

Please find the implementation of those methods:

def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
  val len = this.length
  val b = newBuilder
  if (len == 1) b ++= this
  else if (len > 1) {
    b.sizeHint(len)
    val arr = new Array[AnyRef](len)  // Previously used ArraySeq for more compact but slower code
    var i = 0
    for (x <- this) {
      arr(i) = x.asInstanceOf[AnyRef]
      i += 1
    }
    java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
    i = 0
    while (i < arr.length) {
      b += arr(i).asInstanceOf[A]
      i += 1
    }
  }
  b.result()
}

I also think that when you have pairs of longs, you basically have memory footprint for a pair - i don't know if you also have a footprint for each long - consider using @specialized annotation to use real longs: http://www.scala-lang.org/api/2.12.3/scala/specialized.html

Upvotes: 1

Related Questions