stemm
stemm

Reputation: 6050

MapReduce matrix multiplication complexity

Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   10    |     A     |
|  1  |  2  |   20    |     A     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  1  |  1  |    5    |     B     |
|  1  |  2  |    7    |     B     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )

And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):

First step:

function Map (input is a single row of the file from above):

    i = row[0]
    j = row[1]
    value  = row[2]
    matrix = row[3]

    if(matrix == 'A')
        emit(i, {j, value, 'A'})
    else
        emit(j, {i, value, 'B'})

Complexity of this Map function is O(1)

function Reduce(Key, List of tuples from the Map function):

    Matrix_A_tuples = 
        filter( List of tuples from the Map function, where matrix == 'A' )

    Matrix_B_tuples = 
        filter( List of tuples from the Map function, where matrix == 'B' )

    for each tuple_A from Matrix_A_tuples
        i = tuple_A[0]
        value_A = tuple_A[1]

        for each tuple_B from Matrix_B_tuples
            j = tuple_B[0]
            value_B = tuple_B[1]

            emit({i, j}, {value_A * value_b, 'C'})

Complexity of this Reduce function is O(N^2)

After the first step we will get something like the following file (which contains O(N^3) lines):

+---------------------------------+
|  i  |  j  |  value  |   matrix  |
+---------------------------------+
|  1  |  1  |   50    |     C     |
|  1  |  1  |   45    |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
|  2  |  2  |    70   |     C     |
|  2  |  2  |    17   |     C     |
|     |     |         |           |
| ... | ... |   ...   |    ...    |
|     |     |         |           |
+---------------------------------+

So, all we have to do - just sum the values, from lines, which contains the same values i and j.

Second step:

function Map (input is a single row of the file, which produced in first step):
    i = row[0]
    j = row[1]
    value = row[2]
    emit({i, j}, value)


function Reduce(Key, List of values from the Map function)

    i = Key[0]
    j = Key[1]

    result = 0;

    for each Value from List of values from the Map function
        result += Value

    emit({i, j}, result)


After the second step we will get the file, which contains cells of the matrix C.

So the question is:

Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?

The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K. And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).

But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.

So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?

Upvotes: 4

Views: 1726

Answers (1)

Dhoha
Dhoha

Reputation: 369

as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)

If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...

However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.

I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.

Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..

Upvotes: 2

Related Questions