Reputation: 25
I want some suggestions to optimize my code which is a simple one but it need to be fast and by fast I mean something less than 250 ns.
my first code was slow , about 1000 ns but after some works its about 550 ns
but I believe it can be done faster but I don't know how :<
I am using a PIC32 with 80 MHz system clock
my code:
void main()
{
unsigned long int arr_1[4095];
unsigned long int arr_2[4095];
//here I assign arr_1 and arr_2 values
//...
//...
TRISC = 0;
TRISD = 0;
while(1){
LATC = arr_1[PORTE];
LATD = arr_2[PORTE];
}
}
as you can see its very simple as a job, the only problem is the speed.
I saw the assembly listing just to see how many instructions there are , but I don't know assembly language to optimize it.
;main.c, 14 :: LATC = arr_1[PORTE];
0x9D000064 0x27A30000 ADDIU R3, SP, 0
0x9D000068 0x3C1EBF88 LUI R30, 49032
0x9D00006C 0x8FC26110 LW R2, 24848(R30)
0x9D000070 0x00021080 SLL R2, R2, 2
0x9D000074 0x00621021 ADDU R2, R3, R2
0x9D000078 0x8C420000 LW R2, 0(R2)
0x9D00007C 0x3C1EBF88 LUI R30, 49032
0x9D000080 0xAFC260A0 SW R2, 24736(R30)
;main.c, 15 :: LATD = arr_2[PORTE];
0x9D000084 0x27A33FFC ADDIU R3, SP, 16380
0x9D000088 0x3C1EBF88 LUI R30, 49032
0x9D00008C 0x8FC26110 LW R2, 24848(R30)
0x9D000090 0x00021080 SLL R2, R2, 2
0x9D000094 0x00621021 ADDU R2, R3, R2
0x9D000098 0x8C420000 LW R2, 0(R2)
0x9D00009C 0x3C1EBF88 LUI R30, 49032
;main.c, 16 :: }
0x9D0000A0 0x0B400019 J L_main0
0x9D0000A4 0xAFC260E0 SW R2, 24800(R30)
Any suggestions to optimize my code ?
edit:
*PORTE, LATC and LATD are I/O mapped registers
*The goal of the code to change LATC and LATD registers as fast as possible when PORTE is changed(so PORTE is an input and LATC and LATD are output), the output depend on the value of PORTE
Upvotes: 0
Views: 339
Reputation: 93564
A potential limiting factor is that since PORTE
, LATC
and LATD
are not regular memory but rather I/O registers, it is possible that the I/O bus speed is lower than the memory bus speed and that the processor inserts wait-states between accesses. That may or may not be the case for PIC32, but it is a general point that you need to consider for any architecture.
If the I/O bus is not a limitation then first of all have you applied compiler optimisations? For such micro-optimisations that is usually your best bet. This code seems trivially optimised, but the assembler does not appear to reflect that (although I am no MIPS assembler expert - the compiler optimiser is however).
Since I/O registers are volatile then the optimiser may be defeated at optimising the loop body significantly. But since they are volatile, the code is probably also be unsafe, since it is possible (and indeed likely) for PORTE
to change value between the assignment of LATC
and LATD
which may not be your intention or desirable. If that is the case then the code should be changes as follows:
int porte_value_latch = 0 ;
for(;;)
{
// Get a non-volatile copy of PORTE.
porte_value_latch = PORTE ;
// Write LATC/D with a consistent PORTE value that
// won't change between assignments, and does not need
// to be read from memory or I/O.
LATC = arr_1[porte_value_latch] ;
LATD = arr_2[porte_value_latch] ;
}
which is then both safe and potentially faster since the volatile PORTE
is only read once, and the porte_value_latch
value can be retained in a temporary register for both array accesses rather than read from memory each time. The optimiser will almost certainly optimise it to a register access even if regular compilation does not.
The use of the for(;;)
rather then while(1)
probably makes little difference, but some compilers issue a warning for invariant while expressions, bit will accept the for(;;)
idiom quietly. You have not included the code assembler for line 13 so it is not possible to determine what your compiler generated.
A further possibility for optimisation may be available if LATC
and LATD
are located in adjacent addresses, in which case you might use a single array of type unsigned long long int
in order to write both locations in a single assignment. Of course the 64 bit access is still non-atomic, but the compiler may generate more efficient code in any case. It also neatly avoids the need for the porte_value_latch
variable as there would then be only one reference to PORTE
. However if LATC
and LATD
must be written in a specific order, you loose that level of control. The loop would look like:
for(;;)
{
LATCD = arr_1_2[PORTE] ;
}
Where the address of LATCD
is the low-order address of the adjacent LATC
and LATD
registers, and has type unsigned long long int
. If LATC
has the lower address then:
unsigned long long int LATCD = (unsigned long long int)LATC ;
so that writing to LATCD writes to both LATC and LATD. Toy then have to combine the arr_1
an arr_2
into a single array of unsigned long long
with appropriate word-order so that it contains both C and D values in a single value.
Another suggestion: Configure the hardware to read PORTE to a single location using DMA triggered from a clock signal at >=4MHz. The loop would then not need to read PORTE at all but rather read the DMA memory location which may or may not be faster. You could also set up the DMA to write LATC/LATD from a memory location so that the loop performs no I/O at all. That method would also allow the "adjacent memory" method to work even if LATC and LATD are not actually adjacent.
Ultimately if the issue is only down to the compiler's code generation, then implementing the loop in in-line assembler and hand optimising it may make sense.
Upvotes: 1