user656925
user656925

Reputation:

How can I optimize this image copy function for an embedded system

The function below reads an image a page at a time using read_page(pageIter, pageArr, PAGESIZE) and outputs the data on the DOUT AND CCLK pins.

I was told it was inefficient but I can't seem to find a way to make it faster. It is basically a pipe ,running on a 64 pin uProcessor, between two memory spaces. One holds the image and the other receives the image.

I've used the register keyword, removed array indexing and replaced with pointer arithemetic, but it needs to be faster.

Thanks!

/*
Port C Pin Out
*/
#define     BIT0        0x01    // CCLK
#define     BIT1        0x02    // CS_B
#define     BIT2        0x04    // INIT_B
#define     BIT3        0x08    // PROG_B
#define     BIT4        0x10    // RDRW_B
#define     BIT5        0x20    // BUSY_OUT
#define     BIT6        0x40    // DONE
#define     BIT7        0x80    // DOUT (DIN)

/*
PAGE
*/

#define     PAGESIZE    1024    // Example

void copyImage(ulong startAddress, ulong endAddress)
  {
  ulong pageIter;
  uchar *eByte, *byteIter, pageArr[PAGESIZE];
  register uchar bitIter, portCvar;
  portCvar = PORTC;
  /* Loops through pages in an image using ulong type*/
  for(pageIter = startAddress ;  pageIter <= endAddress ; pageIter += PAGESIZE)
    {
    read_page(pageIter, pageArr, PAGESIZE);
    eByte = pageArr+PAGESIZE;
    /* Loops through bytes in a page using pointer to uchar (pointer to a byte)*/
    for(byteIter = pageArr; byteIter <= eByte; byteIter++)
      {
      /* Loops through bits in byte and writes to PORTC - DIN ANC CCLK  */
      for(bitIter = 0x01; bitIter != 0x00; bitIter = bitIter << 1)
        {
        PORTC = portCvar | BIT0;
        (bitIter & *byteIter) ? (PORTC = portCvar & ~BIT7) : (PORTC = portCvar | BIT7);
        PORTC = portCvar & ~BIT0;
        }
      }
    }
  }

Upvotes: 3

Views: 304

Answers (4)

AShelly
AShelly

Reputation: 35580

I'm assuming that PORTC is in a known state when you enter this function: i.e. the Data and Clock lines are 0? (or Clock is low and Data is high?)

If that assumption is true you should be able to even avoid the conditionals in @6502's answer by first setting value = ~(*byteIter); then doing this 8 times:

 PORTC|=BIT0;PORTC|=(value<<7)&BIT7;PORTC&=~(BIT7|BIT0);value>>=1;

-or, if Bit7 starts high -

 PORTC|=(BIT7|BIT0);PORTC&=(~BIT7|(value<<7));PORTC&=~BIT0;value>>=1;

The advantage here is it avoids the conditionals - which can play havoc on a the speed of a heavily pipelined processor.

Upvotes: 1

6502
6502

Reputation: 114559

Probably you can go faster by unrolling the transmission of each byte with something like

PORTC = clock_1; PORTC = (value & 0x01 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x02 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x04 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x08 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x10 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x20 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x40 ? data1 : data0); PORTC = clock_0;
PORTC = clock_1; PORTC = (value & 0x80 ? data1 : data0); PORTC = clock_0;

after precomputing once outside the image loop

unsigned char clock_1 = portC | BIT0;
unsigned char clock_0 = portC & ~BIT0;
unsigned char data1 = portC | BIT7;
unsigned char data0 = portC & ~BIT7;

Upvotes: 5

Lindydancer
Lindydancer

Reputation: 26124

/* Loops through bits in byte and writes to PORTC - DIN ANC CCLK  */
for(bitIter = 0x01; bitIter <= 0x80; bitIter = bitIter << 1)
{
    PORTC = portC | BIT0;
    (bitIter & byteIter) ? (PORTC = portC & ~BIT7) : (PORTC = portC | BIT7);
    PORTC = portC & ~BIT0;
}

To start with, this loop is broken. bitIter is an uchar (which I assume is an unsigned 8-bit character). By shifting it to the left it will eventually get the value 0x80 for the intended final iteration. After the next shift it will get the value 0.

Over to the efficiency. Depending on the architecture, doing the operation PORTC = PORTC | BIT0 might result in a single bit set. However, it also might result in a read, set a bit in a register, and a store.

As mentioned before, if possible, try to set the BIT0 and BIT7 at the same time (if the hardware permits this).

I would try something like:

bitIter = 0x01;
do
{
  if (byteIter & bitIter)
  {
    PORTC = BIT0;
  }
  else
  {
    PORTC = (BIT0 | BIT7);
  }
  PORTC = 0;

  bitIter <<= 1;
} while (bitIter != 0x80);

By using a do ... while loop, it will terminate problem and you would get rid of the unnecessary comparison of the loop test before the first iteration (unless your compiler already have optimized it away).

You could try to unroll the loop, by hand, eigth times, once for every bit.

Upvotes: 2

Ben Jackson
Ben Jackson

Reputation: 93860

/* Loops through bits in byte and writes to PORTC - DIN ANC CCLK  */
      for(bitIter = 0x01; bitIter <= 0x80; bitIter = bitIter << 1)
        {
    PORTC = portC | BIT0;
    (bitIter & byteIter) ? (PORTC = portC & ~BIT7) : (PORTC = portC | BIT7);
    PORTC = portC & ~BIT0;
    }

That loop is the key. I would compile it with production optimization flags and then look at the disassembly. The compiler may do all kinds of clever things like unroll the loop or simplify the loop condition. If I didn't like what I saw there I'd start tweaking the C code to help the compiler find a good optimization. If that proved impossible then I might use inline assembly to get what I want.

Assuming we can go as fast as possible (and delays in the loop aren't accounting for setup-hold times at the receiver) then I'd want to get that loop down to as few instructions as possible. Can you set BIT0 and also the data bit at the same time or does that create a hazard at the receiver? If you can that would save an instruction or two. Lots of microoptimizations would rely on the specific instruction set. If the data has lots of 0 or 0xFF you could make special unrolled cases where the data bit doesn't change and BIT0 toggles 8 times. You could make 16 unrolled cases for a single nybble and switch into that twice for each byte.

Upvotes: 2

Related Questions