Hisoka Hunter
Hisoka Hunter

Reputation: 25

PIC32 speed : Optimizing c code

I want some suggestions to optimize my code which is a simple one but it need to be fast and by fast I mean something less than 250 ns.
my first code was slow , about 1000 ns but after some works its about 550 ns but I believe it can be done faster but I don't know how :<
I am using a PIC32 with 80 MHz system clock
my code:

void main()
{
    unsigned long int arr_1[4095]; 
    unsigned long int arr_2[4095]; 

    //here I assign arr_1 and arr_2 values
    //...
    //...

    TRISC = 0;
    TRISD = 0;

    while(1){
         LATC = arr_1[PORTE];
         LATD = arr_2[PORTE];
    }

}

as you can see its very simple as a job, the only problem is the speed.
I saw the assembly listing just to see how many instructions there are , but I don't know assembly language to optimize it.

;main.c, 14 ::      LATC = arr_1[PORTE];
0x9D000064  0x27A30000  ADDIU   R3, SP, 0
0x9D000068  0x3C1EBF88  LUI R30, 49032
0x9D00006C  0x8FC26110  LW  R2, 24848(R30)
0x9D000070  0x00021080  SLL R2, R2, 2
0x9D000074  0x00621021  ADDU    R2, R3, R2
0x9D000078  0x8C420000  LW  R2, 0(R2)
0x9D00007C  0x3C1EBF88  LUI R30, 49032
0x9D000080  0xAFC260A0  SW  R2, 24736(R30)
;main.c, 15 ::      LATD = arr_2[PORTE];
0x9D000084  0x27A33FFC  ADDIU   R3, SP, 16380
0x9D000088  0x3C1EBF88  LUI R30, 49032
0x9D00008C  0x8FC26110  LW  R2, 24848(R30)
0x9D000090  0x00021080  SLL R2, R2, 2
0x9D000094  0x00621021  ADDU    R2, R3, R2
0x9D000098  0x8C420000  LW  R2, 0(R2)
0x9D00009C  0x3C1EBF88  LUI R30, 49032
;main.c, 16 ::      }
0x9D0000A0  0x0B400019  J   L_main0
0x9D0000A4  0xAFC260E0  SW  R2, 24800(R30)  

Any suggestions to optimize my code ?

edit:
*PORTE, LATC and LATD are I/O mapped registers *The goal of the code to change LATC and LATD registers as fast as possible when PORTE is changed(so PORTE is an input and LATC and LATD are output), the output depend on the value of PORTE

Upvotes: 0

Views: 339

Answers (1)

Clifford
Clifford

Reputation: 93564

A potential limiting factor is that since PORTE, LATC and LATD are not regular memory but rather I/O registers, it is possible that the I/O bus speed is lower than the memory bus speed and that the processor inserts wait-states between accesses. That may or may not be the case for PIC32, but it is a general point that you need to consider for any architecture.

If the I/O bus is not a limitation then first of all have you applied compiler optimisations? For such micro-optimisations that is usually your best bet. This code seems trivially optimised, but the assembler does not appear to reflect that (although I am no MIPS assembler expert - the compiler optimiser is however).

Since I/O registers are volatile then the optimiser may be defeated at optimising the loop body significantly. But since they are volatile, the code is probably also be unsafe, since it is possible (and indeed likely) for PORTE to change value between the assignment of LATC and LATD which may not be your intention or desirable. If that is the case then the code should be changes as follows:

int porte_value_latch = 0 ;
for(;;)
{
     // Get a non-volatile copy of PORTE.
     porte_value_latch = PORTE ;  

     // Write LATC/D with a consistent PORTE value that 
     // won't change between assignments, and does not need 
     // to be read from memory or I/O.
     LATC = arr_1[porte_value_latch] ;
     LATD = arr_2[porte_value_latch] ;
}

which is then both safe and potentially faster since the volatile PORTE is only read once, and the porte_value_latch value can be retained in a temporary register for both array accesses rather than read from memory each time. The optimiser will almost certainly optimise it to a register access even if regular compilation does not.

The use of the for(;;) rather then while(1) probably makes little difference, but some compilers issue a warning for invariant while expressions, bit will accept the for(;;) idiom quietly. You have not included the code assembler for line 13 so it is not possible to determine what your compiler generated.

A further possibility for optimisation may be available if LATC and LATD are located in adjacent addresses, in which case you might use a single array of type unsigned long long int in order to write both locations in a single assignment. Of course the 64 bit access is still non-atomic, but the compiler may generate more efficient code in any case. It also neatly avoids the need for the porte_value_latch variable as there would then be only one reference to PORTE. However if LATCand LATD must be written in a specific order, you loose that level of control. The loop would look like:

for(;;)
{
    LATCD = arr_1_2[PORTE] ;
}

Where the address of LATCD is the low-order address of the adjacent LATC and LATD registers, and has type unsigned long long int . If LATC has the lower address then:

unsigned long long int LATCD = (unsigned long long int)LATC ;

so that writing to LATCD writes to both LATC and LATD. Toy then have to combine the arr_1 an arr_2 into a single array of unsigned long long with appropriate word-order so that it contains both C and D values in a single value.

Another suggestion: Configure the hardware to read PORTE to a single location using DMA triggered from a clock signal at >=4MHz. The loop would then not need to read PORTE at all but rather read the DMA memory location which may or may not be faster. You could also set up the DMA to write LATC/LATD from a memory location so that the loop performs no I/O at all. That method would also allow the "adjacent memory" method to work even if LATC and LATD are not actually adjacent.

Ultimately if the issue is only down to the compiler's code generation, then implementing the loop in in-line assembler and hand optimising it may make sense.

Upvotes: 1

Related Questions