How to efficiently modify blocks in Rust array using rayon in parallel?

Question

I'm trying to build a scientific computing software with Rust, which requires manipulation of the matrix during operation. A typical matrix operation is to append sub-blocks of small matrices to a specific position of a large matrix. Let me demonstrate this process using the following formula:

Suppose I have a matrix M_{master} with 8 rows and 8 columns, and all elements are 0 in the initial state：

There are also two 6-row 6-column matrices M_{slave1} and M_{slave2}:

Step 1) Append write the four sub-blocks in M_{slave1} into M_{master}, we get:

Step 2) Append write the four sub-blocks in M_{slave2} into M_{1}, we get:

The green element in M_{master12} is located at the position where both M_{slave1} and M_{slave2} need to be written.

For the problem described above, I have implemented a single-threaded solution with the help of nested for loops. Here is my code and the results:（You can run the single-threaded version of the code directly in the Rust Playground and see the results: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=cc5db512a4445ae3a92b1668e0611eba

use rayon::prelude::*;
use std::time::Instant;

fn main() {
    const R: usize = 8;
    const C: usize = 8;

    const R_SUB1: usize = 6;
    const C_SUB1: usize = 6;

    const R_SUB2: usize = 6;
    const C_SUB2: usize = 6;

    let mut arr1: [[u8; C]; R] = [[0; C]; R];
    for row in 1..=R {
        for col in 1..=C {
            arr1[row - 1][col - 1] = 0;
        }
    }
    arr2d_print(arr1);

    let sub1: [[u8; C_SUB1]; R_SUB1] = [[9; C_SUB1]; R_SUB1];
    arr2d_print(sub1);
    let cpld1: Vec = vec![0, 1, 3]; //Parameters related to the position where the sub-block is written, not the direct position

    let sub2: [[u8; C_SUB2]; R_SUB2] = [[90; C_SUB2]; R_SUB2];
    arr2d_print(sub2);
    let cpld2: Vec = vec![0, 2, 3];

    let subs = vec![sub1, sub2];
    let cplds = vec![cpld1, cpld2];

    let t1 = Instant::now();
    let _ = cplds
        .iter()
        .zip(subs.iter())
        .map(|(cpld, sub)| sqare_matrix_block_fill(&mut arr1, sub, cpld))
        .count();
    let tot1 = t1.elapsed();
    arr2d_print(arr1);
    println!("The single-threaded version takes time: {:?}
", tot1);
}

fn sqare_matrix_block_fill(
    master: &mut [[u8; D1]; D1],
    slave: &[[u8; D2]; D2],
    cplds: &[usize],
) {
    if D2 != 2 * cplds.len() {
        panic!("!!! Wrong !!!");
    }

    for idx in 0..cplds.len() {
        for idy in 0..cplds.len() {
            for row in 0..2 {
                for col in 0..2 {
                    master[cplds[idx] * 2 + row][cplds[idy] * 2 + col] +=
                        slave[idx * 2 + row][idx * 2 + col];
                }
            }
        }
    }
}

fn arr2d_print(arr: [[u8; D]; D]) {
    for i in 0..D {
        for j in 0..D {
            print!("{:-4.1} ", arr[i][j]);
        }
        print!("
");
    }
    println!("");
}

The result of running the above code：

Standard Output
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 

   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 

  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 

  99   99    9    9   90   90   99   99 
  99   99    9    9   90   90   99   99 
   9    9    9    9    0    0    9    9 
   9    9    9    9    0    0    9    9 
  90   90    0    0   90   90   90   90 
  90   90    0    0   90   90   90   90 
  99   99    9    9   90   90   99   99 
  99   99    9    9   90   90   99   99 

The single-threaded version takes time: 4.561µs

Now my question is: how can I build a parallel version of the sqare_matrix_block_fill function using Rayon?

I know that there will inevitably be data races on the locations that are written repeatedly in large matrices, but hey, this is exactly why I came to Stack Overflow, sincerely begging for help!

How to efficiently modify blocks in Rust array using rayon in parallel?

Answers (0)

Related Questions