DispatchQueue: why does serial complete faster than concurrent?

Question

I have a unit test setup to prove that concurrently performing multiple heavy tasks is faster than serial.

Now... before everyone in here loses their minds over the fact that above statement is not always correct because multithreading comes with many uncertainties, let me exlain.

I know from reading the apple documentation that you can not guarantee you get multiple threads when asking for them. The OS (iOS) will assign threads however it sees fit. If the device only has one core for example, it will assign one core and serial will be slightly faster due to initialisation code of concurrent operation taking some extra time whilst not delivering a performance improvement because the device has only one core.

However: This difference should only be slight. But in my POC setup the difference is massive. In my POC, concurrent is slower by about 1/3 of the time.

If serial completes in 6 seconds, concurrent will complete in 9 seconds.
This trend continues even with heavier loads. if serial completes in 125 seconds, concurrent will compete in 215 seconds. This also happens not just once but solid every time.

I wonder if I made a mistake in creating this POC, and if so, how should I prove that concurrently performing multiple heavy tasks is indeed faster than serial?

My POC in swift unit tests:

func performHeavyTask(_ completion: (() -> Void)?) {
    var counter = 0
    while counter < 50000 {
        print(counter)
        counter = counter.advanced(by: 1)
    }
    completion?()
}

// MARK: - Serial
func testSerial () {
    let start = DispatchTime.now()
    let _ = DispatchQueue.global(qos: .userInitiated)
    let mainDPG = DispatchGroup()
    mainDPG.enter()
    DispatchQueue.global(qos: .userInitiated).async {[weak self] in
        guard let self = self else { return }
        for _ in 0...10 {
            self.performHeavyTask(nil)
        }
        mainDPG.leave()
    }
    mainDPG.wait()
    let end = DispatchTime.now()
    let nanoTime = end.uptimeNanoseconds - start.uptimeNanoseconds // <<<<< Difference in nano seconds (UInt64)
    print("NanoTime: \(nanoTime / 1_000_000_000)")
}

// MARK: - Concurrent
func testConcurrent() {
    let start = DispatchTime.now()
    let _ = DispatchQueue.global(qos: .userInitiated)
    let mainDPG = DispatchGroup()
    mainDPG.enter()
    DispatchQueue.global(qos: .userInitiated).async {
        let dispatchGroup = DispatchGroup()
        let _ = DispatchQueue.global(qos: .userInitiated)
        DispatchQueue.concurrentPerform(iterations: 10) { index in
            dispatchGroup.enter()
            self.performHeavyTask({
                dispatchGroup.leave()
            })
        }
        dispatchGroup.wait()
        mainDPG.leave()
    }
    mainDPG.wait()
    let end = DispatchTime.now()
    let nanoTime = end.uptimeNanoseconds - start.uptimeNanoseconds // <<<<< Difference in nano seconds (UInt64)
    print("NanoTime: \(nanoTime / 1_000_000_000)")
}

Details:

OS: macOS High Sierra
Model Name: MacBook Pro
Model Identifier: MacBookPro11,4
Processor Name: Intel Core i7
Processor Speed: 2,2 GHz
Number of Processors: 1
Total Number of Cores: 4

Both tests were done on iPhone XS Max simulator. Both tests were done straight after a reboot of the entire mac was done (to avoid the mac being busy with applications other than running this unit test, blurring results)

Also, both unit tests are wrapped in an async DispatcherWorkItem because the testcase is for the main (UI) queue not to be blocked, preventing the serial testcase to have an advantage on that part as it consumes the main queue instead of a background queue as the concurrent testcase does.

I'll also accept an answer that shows a POC reliably testing this. It does not have to show concurrent is faster than serial all the time (read above explanation as to why not). But at least some time

Rob · Accepted Answer

There are two issues:

I’d avoid doing print inside the loop. That’s synchronized and you’re likely to experience greater performance degradation in concurrent implementation. That’s not the whole story here, but it doesn’t help.
Even after removing the print from within the loop, 50,000 increments of the counter is simply not enough work to see the benefit of concurrentPerform. As Improving on Loop Code says:

... And although this [concurrentPerform] can be a good way to improve performance in loop-based code, you must still use this technique discerningly. Although dispatch queues have very low overhead, there are still costs to scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs. Exactly how much work you need to do is something you have to measure using the performance tools.

On debug build, I needed to increase number of iterations to values closer to 5,000,000 before this overhead was overcome. And on release build, even that wasn’t sufficient. A spinning loop and incrementing a counter is just too quick to offer meaningful analysis of concurrent behavior.

So, in my example below, I replaced this spinning loop with a more computationally intensive calculation (calculating π using a historic, but not terribly efficient, algorithm).

As an aside:

Rather than measuring the performance yourself, if you do this within a XCTestCase unit test, you can use measure to benchmark performance. This repeats the benchmarking multiple times, captures elapsed time, averages the results, etc. Just make sure to edit your scheme so the test action uses an optimized “release” build rather than a “debug” build.
There’s no point in dispatching this to a global queue if you’re going to use dispatch group to make the calling thread wait for it to complete.
You don’t need to use dispatch groups to wait for concurrentPerform to finish, either. It runs synchronously.

As the concurrentPerform documentation says:

The dispatch queue executes the submitted block the specified number of times and waits for all iterations to complete before returning.
It’s not really material, but it’s worth noting that your for _ in 0...10 { ... } is doing 11 iterations, not 10. You obviously meant to use ..<.

Thus, here is an example, putting it in a unit test, but replacing the “heavy” calculation with something more computationally intensive:

class MyAppTests: XCTestCase {

    // calculate pi using Gregory-Leibniz series
    
    func calculatePi(iterations: Int) -> Double {
        var result = 0.0
        var sign = 1.0
        for i in 0 ..< iterations {
            result += sign / Double(i * 2 + 1)
            sign *= -1
        }
        return result * 4
    }
    
    func performHeavyTask(iteration: Int) {
        let pi = calculatePi(iterations: 100_000_000)

        print(iteration, .pi - pi)
    }
    
    func testSerial() {
        measure {
            for i in 0..<10 {
                self.performHeavyTask(iteration: i)
            }
        }
    }
    
    func testConcurrent() {
        measure {
            DispatchQueue.concurrentPerform(iterations: 10) { i in
                self.performHeavyTask(iteration: i)
            }
        }
    }
}

On my MacBook Pro 2018 with 2.9 GHz Intel Core i9, with a release build the concurrent test took, on average, 0.247 seconds, whereas the serial test took roughly four times as long, 1.030 seconds.

DispatchQueue: why does serial complete faster than concurrent?

Answers (1)

Related Questions