Reputation: 439
I'm writing some code for matrix multiplication in assembly language. I cannot use variables and only storage on the stack what i need. The algorithm seems working right, but i have problems with IMUL and MOV using registers in the last two blocks of code. I post my code here:
unsigned int m = 3; // raws of mat1
unsigned int n = 2; // columns of mat1
unsigned int k = 4; // columns of mat2
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // first matrix
short int mat2[] = { 2,0,4,6, 0,2,-1,3 }; // second matrix
int mat3[1024]; // output matrix
__asm {
XOR EAX, EAX //mat1 raws counter
XOR EBX, EBX //mat2 columns counter
XOR EDX, EDX //mat1 columns(equal to mat2 raws) counter
XOR EDI, EDI //will contain sum of multiplications to be copied into output matrix
Loop1 : //determinates the raws of output matrix: mat3
XOR EBX, EBX //at the end of first raw, column counter is resetted
CMP m, EAX //if loopped mat1 m-raws times...
JZ Finale //...algortihm is over
INC EAX //increase mat1 raws counter
JMP Loop2
Loop2 : //determinates the columns of mat3
XOR EDX, EDX //at the end of the n-sums, mat1 column counter is resetted
XOR EDI, EDI //after sum of n-multiplications edi is resetted
CMP k, EBX //if multiplications/sums on this raw have been done...
JZ Loop1 //...go to next output matrix raw
INC EBX //increase mat2 columns counter
JMP Loop3
Loop3 : //determinates elements of mat3
CMP n, EDX //if the n-multiplacations/sums on first n-elements have been done...
JZ Loop2 //...skip to next n-elements
INC EDX //increase counter of the elements that will be multiplicate
JMP Stuffs //go to operations code block
Stuffs : //here code generates mat3 elements
#58 MOV SI, mat1[2 * ((EAX - 1) * 2 + (EDX - 1)] //moves to SI the [m-raws/n-clomumn] element of mat1
#59 IMUL SI, mat2[2 * ((EBX - 1) * 2 + (EDX - 1)] //multiplicates(with sign) SI and [n-raws/k-column] element of mat2
ADD DI, SI //sum the result in edi
CMP n, EDX //check the sums
JZ CopyResult //if n-sums have been done...
JMP Loop3 //...go to copy result into mat3
CopyResult :
#66 MOV mat3[4 * ((EAX - 1) * 4 + (EBX - 1))], EDI //copy value to output matrix mat3
JMP Loop3 //go to next n-elements
Finale :
}
{
unsigned int i, j, h;
printf("Output matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
In this code, compiler reports two types of error referring IMUL and MOV for mat1, mat2 and mat3. Here they are:
The same errors for lines 59 and 66, with EDX and EBX registers.
Is this algortihm basically good? (i tested setting manually some indexes, and then the last one, during debug and it was good but i couldn't totally test it).
I think that the first error depends on the second one, but if i cannot use this way the registers, what can i do to compute output?
Upvotes: 2
Views: 17841
Reputation: 439
Thanks everyone. This is my final code for triple nested loops and matrix product. For what i tested, it works with any matrix and positive/negative values:
void main()
{
unsigned int m = 3; // numero di righe della prima matrice
unsigned int n = 2; // numero di colonne della prima matrice
unsigned int k = 4; // numero di colonne della seconda matrice
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // prima matrice
short int mat2[] = { 2,0,0,0, 0,2,0,0 }; // seconda matrice
int mat3[1024]; // matrice risultato
__asm {
lea eax, mat1 //load mat1
lea edi, mat3 //load mat result
push m
Loop3 : //extern loop
lea ebx, mat2 //load here mat2 to start from the beginning when new result raw starts
xor edx, edx //sets to zero column counter set to zero when new result row starts
Loop2 : //middle loop, as long as k, mat2 columns
xor ecx, ecx //sets to zero mat1 column counter every n multiplications
Loop1 : //inner loop
call Compute //calls sub program that calulates raw/column products
inc ecx //increase column counter
cmp ecx, n //check column counter
jb Loop1 //if below loop1 again
add ebx, 2 //if equal to n, inner loop is over, move mat2 position of one position
inc edx //increase mat2 column counter
cmp edx, k //chek mat2 column counter
jb Loop2 //if below loop2 again
imul esi, n, 2 //else calculate offset to skip to new raw in mat1
add eax, esi //...skip to new mat1 raw
imul esi, k, 4 //calculate offset to skip to new raw in result matrix(mat3)
add edi, esi //...skip to new raw in mat3
dec m //a raw in mat1 has been done, decrease its counter
cmp m, 0 //check how many raws has been done
ja Loop3 //if more than zero, do extern loop again
jmp Finale //else algorithm is over
Compute : //calulates raw/column products
movsx esi, WORD PTR[eax][ecx * 2]
push edi //stores mat3 address to free edi counter
push ecx //stores the actual value of column counter to free ecx register
mov edi, k //calculates the offset in mat2...
imul edi, ecx //...
movsx ecx, WORD PTR[ebx][edi * 2] //mov the value of mat2 to ecx
imul esi, ecx //multiplicates values of mat1 ad mat2
pop ecx //set back column counter
pop edi //set back mat3 address
cmp ecx, 0 //if ecx is zero...
je First //...is the first multiplication for this result value...
add[edi][edx * 4], esi //if not the first, add the value to current position
Back :
ret //in any case, comes back to loops...
First : //...so move here the first value to which add the others
mov[edi][edx * 4], esi //moving value
jmp Back
Finale : //the end
pop m //restore the original mat1 raw value to print the result matrix below
}
//Output on video
{
unsigned int i, j, h;
printf("Product Matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
}
And this is only the algorithm for triple nested loops i wrote starting from a double nested loops found here in stackoverflaw:
m = raws of first matrix
k = columns of second matrix
n = column of first matrix and raws of second matrix
x=0
Loop3:
y=0
Loop2:
z=0
Loop1:
compute...
z++
if(z<n)
go to Loop1
y++
if(y < k)
go to Loop2
x++
if(x < m)
go to Loop3
Else go to the End
Upvotes: 2
Reputation: 363950
Instead of trying to scale multiple registers by two in an addressing mode (which is impossible), just use add eax, 2
instead of inc eax
.
Also, since your output matrix uses 32-bit int
, you should be doing 32-bit math. You're generating a value in DI and then storing that plus whatever garbage is in the high half of EDI with line #66.
Something like
movsx esi, word ptr [rowstart + column]
/
movsx eax, word ptr [offset_in_column + row]
/
imul eax, esi
might work for (part of) the body of the inner loop. I'll let you sort out incrementing by columns in the first addressing-mode, and incrementing by rows in the 2nd addressing mode.
I think your algorithm is probably sane, based on what I think you're trying to do. For each element of the output matrix, loop over a column in one matrix and a row in the other matrix. So you only ever store once to each element of the output matrix. Whether your loops actually do that or not, IDK: It hurts my brain how ugly the branching is. (look at optimizing compiler output for a loop sometime, then for a double or triple nested loop. e.g. on http://gcc.godbolt.org/).
The other ways to nest the loops might be better or worse for performance with large matrices, but the only really good ways to do this (for large matrices) involve transposing one of the input matrices so you can loop over contiguous memory elements in both matrices at once (since transposing costs O(n^2) time, but speeds up the O(n^3) step which traverses the transposed array repeatedly, because it gives more cache hits).
(Given how common floating point matmul is in scientific computing, this is a topic that's been studied extensively, with much effort put into experimental tuning of code. See various implementations of the DGEMM function in BLAS.)
Upvotes: 4