Aliquis
Aliquis

Reputation: 2309

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.

I read a blog post about this topic and there you can find this code snippets:

List implementation:

val list = generateSequence(1) { it + 1 }
    .take(50_000_000)
    .toList()

measure {
    list
        .filter { it % 3 == 0 }
        .average()
}
// 8644 ms

Sequence implementation:

val sequence = generateSequence(1) { it + 1 }
    .take(50_000_000)

measure {
    sequence
        .filter { it % 3 == 0 }
        .average()
}
// 822 ms

The point here is that the Sequence implementation is about 10x faster.

However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.

However, here I know why a Sequence is generally faster:

val result = sequenceOf("a", "b", "c")
    .map {
        println("map: $it")
        it.toUpperCase()
    }
    .any {
        println("any: $it")
        it.startsWith("B")
    }

Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.

So, why is it also faster in the first example?

Upvotes: 8

Views: 1740

Answers (2)

gidds
gidds

Reputation: 18577

Let's look at what those two implementations are actually doing:

  1. The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.

    (In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)

    Then it has to read all the values back from the list, filter them, and create another list in memory.

    Finally, it has to read all those values back in again, to calculate the average.

  2. The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.

    (That's pretty much how you'd do it if you were implementing it ‘by hand’.)

You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!

Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).

In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!

Upvotes: 13

Grzegorz Piwowarek
Grzegorz Piwowarek

Reputation: 13793

Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.

Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.

Initial code produces a list containing 50_000_000 objects:

 val list = generateSequence(1) { it + 1 }
  .take(50_000_000)
  .toList()

then iterates through it and creates another list containing a subset of its elements:

.filter { it % 3 == 0 }

... and then proceeds with calculating the average:

.average()

Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:

val sequence = generateSequence(1) { it + 1 }
  .take(50_000_000)

adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):

.filter { it % 3 == 0 }

and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:

.average()

Some relevant reading:

Kotlin: Beware of Java Stream API Habits

Kotlin Collections API Performance Antipatterns

Upvotes: 5

Related Questions