Reputation: 11

Why is matrix multiplication with concurrency slower than without concurrency?

There are some parts of my program, which import a file that contains two matrices and multiply them.

But I am confused with why the duration time with concurrency is longer than that without concurrency?

Are there any bugs in my code?

// without concurrency
let mut result = vec![];

let time1 = Instant::now();
for i in 0..n {
    let mut temp_vector = vec![];
    for j in 0..n {
        let mut temp_num = 0;
        for multiple_count in 0..m {
            temp_num = temp_num + arr1[i][multiple_count] * arr2[multiple_count][j];
        }
        temp_vector.push(temp_num);
    }
    result.push(temp_vector);
}
let time2 = Instant::now();
println!("normal solving result:\n");

for i in 0..n {
    for j in 0..n {
        print!("{:?} ", result[i][j]);
    }
   println!("");
}
let pass = time2.duration_since(time1);
println!("{:?}\n",pass);

println!("concurrency solving solution:\n");


// start the concurrency

let mut handles = vec![];

let arr1 = Arc::new(RwLock::new(arr1));
let arr2 = Arc::new(RwLock::new(arr2));

let count_time1 = Instant::now();
for i in 0..n {
    for j in 0..n {
        let arr1 = arr1.clone();
        let arr2 = arr2.clone();
        let handle = thread::spawn(move || {
            let mut count = 0;
            let arr1 = arr1.try_read().unwrap();
            let arr2 = arr2.try_read().unwrap();
            for k in 0..m {
                count = count + arr1[i][k] * arr2[k][j];
            }
            count
        });
        handles.push(handle);
    }
}
let count_time2 = Instant::now();
let pass_time = count_time2.duration_since(count_time1);

Upvotes: 1

Answers (2)

Linear

Reputation: 22196

There are several reasons, but the few that jump out:

You're cloning two Arcs each time, which involves an atomic addition, which is slow. While they're not in the threads, it is delaying the START of each thread.
You're using a Read-Write mutex (RWLock) each time. Even if it's just a read, this still almost certainly involves at least one atomic write, and an atomic read of some Mutex state, to some internal counter.
Spawning OS threads is not free or instantaneous. It has a startup cost.
Make sure you're using sufficiently sized data! We're usually talking at least n*n=10000 to get anything noticeable from parallelism. Usually several orders of magnitude more. This is part of why 3 is bad.
Rust doesn't use lightweight coroutines. These are full-on OS threads. You'd be better spawning as many threads as you have logical cores (logical means physical+hyperthreading, your OS should report them all as actual cores) and evenly distributing the cost across all cores.

You could probably get a pretty significant speed-up by ditching the RWLock (you don't need it since your data is read-only), since the Arc is only delaying the startup time, and the time it takes a thread to join (since it needs to drop the Arc). However, by far your biggest speedup is going to be only spawning 4-8 threads depending on your processor. I'll leave it to you how to best split it into chunks, but it's fairly straightforward.

Edit: In fact, you can probably get rid of the Arc too, since the threads immediately join, but depending on Rust thread lifetime weirdness, you may need the crossbeam::scoped functionality from the crossbeam crate to actually make it work.

As an aside, once you move to concurrent writing to the same data structures, I highly encourage you to look up info on the processor cache, specifically, false sharing. While Mutexes are likely to be the higher cost in Rust, if you can somehow eschew them (e.g. by splitting a slice with split_mut), you'll likely get bad flailing by constantly invalidating your cache around the boundaries.

Upvotes: 4

Lukas Kalbertodt

Reputation: 88526

Without looking into too much detail: you are spawning n² threads -- one for each cell in the result matrix. Spawning threads is expensive (note that Rust doesn't use "green thread", but system threads by default).

Concurrency doesn't simply speed up everything; one has to be a bit smart about it. Usually you just want to utilize all CPU cores, hence you should only spawn roughly as many threads as there are cores. In your case, spawning the thread probably takes much more time than what the thread does, thus the slowdown.

Upvotes: 4

Why is matrix multiplication with concurrency slower than without concurrency?

Answers (2)

Related Questions