MapReduce matrix multiplication complexity

Question

Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   10    |     A     |
|  1  |  2  |   20    |     A     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  1  |  1  |    5    |     B     |
|  1  |  2  |    7    |     B     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )

And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):

First step:

function Map (input is a single row of the file from above):

    i = row[0]
    j = row[1]
    value  = row[2]
    matrix = row[3]

    if(matrix == 'A')
        emit(i, {j, value, 'A'})
    else
        emit(j, {i, value, 'B'})

Complexity of this Map function is O(1)

function Reduce(Key, List of tuples from the Map function):

    Matrix_A_tuples = 
        filter( List of tuples from the Map function, where matrix == 'A' )

    Matrix_B_tuples = 
        filter( List of tuples from the Map function, where matrix == 'B' )

    for each tuple_A from Matrix_A_tuples
        i = tuple_A[0]
        value_A = tuple_A[1]

        for each tuple_B from Matrix_B_tuples
            j = tuple_B[0]
            value_B = tuple_B[1]

            emit({i, j}, {value_A * value_b, 'C'})

Complexity of this Reduce function is O(N^2)

After the first step we will get something like the following file (which contains O(N^3) lines):

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   50    |     C     |
|  1  |  1  |   45    |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  2  |  2  |    70   |     C     |
|  2  |  2  |    17   |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

So, all we have to do - just sum the values, from lines, which contains the same values i and j.

Second step:

function Map (input is a single row of the file, which produced in first step):
    i = row[0]
    j = row[1]
    value = row[2]
    emit({i, j}, value)

function Reduce(Key, List of values from the Map function)

    i = Key[0]
    j = Key[1]

    result = 0;

    for each Value from List of values from the Map function
        result += Value

    emit({i, j}, result)

After the second step we will get the file, which contains cells of the matrix C.

So the question is:

Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?

The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K. And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).

But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.

So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?

MapReduce matrix multiplication complexity

Answers (1)

Related Questions