Reputation: 79

DMA transfer taking more time than CPU transfer

Our task is intended to demonstrate the benefit of using DMA to copy a large amount of data versus relying on the processor to directly handle the copying. The processor is an STM32F407 on the ST discovery board.

In order to measure the copying time, a GPIO pin must be turned ON during copying and OFF once it has been copied.

The code appears to be functional but it is currently showing the CPU taking about 2.15ms to complete and DMA about 4.5ms, which is the opposite of what is intended. I'm not sure if there simply isn't enough data for the faster speed of DMA to offset the overhead in setting it up perhaps?

I have tried both copying elements of an array using the CPU and also using the memcpy function which seemed to yield very similar times.

The function code is shown below:

DMASpeed(void)
{
    #define elementNum 32000
    int *ptr = NULL;
    ptr = (int*)malloc(elementNum * sizeof(int));
    int *ptr2 = NULL;
    ptr2 = (int*)malloc(elementNum * sizeof(int));
    for (int i = 0; i < elementNum; i++)
    {
        ptr[i] = 4;
    }
    LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
    LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;
    // Initial value
    // printf("BEFORE: dst = '%s'\n", dst);

    // Transfer
    printf("Initiate DMA Transfer...\n");
    HAL_DMA_Start(&hdma_memtomem_dma2_stream0, (int)ptr, (int)ptr2, (elementNum * sizeof(int)));
    LD5_GPIO_Port->BSRR = LD5_Pin;
    printf("DMA Transfer initiated.\n");


    // Poll for DMA completion
    printf("Poll for DMA completion.\n");
    HAL_DMA_PollForTransfer(&hdma_memtomem_dma2_stream0,
        HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
    LD5_GPIO_Port->BSRR = (uint32_t)LD5_Pin << 16U;
    printf("DMA complete.\n");

    // Print result
    // printf("AFTER: dst = '%s'\n", dst);
    free(ptr);
    free(ptr2);

    ptr = (int*)malloc(elementNum * sizeof(int));
    ptr2 = (int*)malloc(elementNum * sizeof(int));
    for (int i = 0; i < elementNum; i++)
    {
        ptr[i] = i;
    }

    printf("Initiate CPU Transfer...\n");
    LD6_GPIO_Port->BSRR = LD6_Pin;
    //  for (int i = 0; i<512; i++)
    //  {
    //  ptr2[i] = ptr[i];
    //  }
    memcpy(ptr2, ptr, (elementNum * sizeof(int)));
    printf("CPU Transfer Complete.\n");
    LD6_GPIO_Port->BSRR = (uint32_t)LD6_Pin << 16U;

    free(ptr);
    free(ptr2);
}

Thanks in advance for any assistance

Upvotes: 5

Answers (3)

Aleksandr Belykh

Reputation: 1

For those, who google for "How to fasten DMA memory-to-memory transfer?" here is the piece of advice: force your compiler to allocate all HAL code, related to your DMA transfer to the RAM, the best is to the RAM exclusively coupled with the Core. Your compiler will generate function code, which will be copied to the specific RAM at startup, and then all that functions will be called from the RAM and sped up because of it. However, that is also true for copying "by hand". In this case, it is recommended to allocate to the RAM the following files/functions:

stm32[whatever]_hal_dma.c
DMA[N]_Stream[M]_IRQHandler(), where N and M are the numbers of your DMA and stream used for the transfer respectively.

Upvotes: -1

0___________

Reputation: 67820

you try to proof something what is not the true. DMA memory to memory transfer will be always slower than direct CPU one. DMA was not intended to be faster than the CPU. it's there is to provide the transfer w without the CPU activity in the background. the core has always priority over the DMA.

MEM to MEM DMA transfer will be always slower than the CPU one

There is another problem as well. Many STM devices have memory areas which are not accessible by the DMA (for example CCMRAM).

Upvotes: 10

Vagish

Reputation: 2547

Remove printf in below code segment:

LD5_GPIO_Port->BSRR = LD5_Pin;
printf("DMA Transfer initiated.\n");  // <--Remove this


// Poll for DMA completion
printf("Poll for DMA completion.\n"); // <--Remove this

You are turning ON the pin and then printing large text , it is adding up in your total time calculation.

Remove all printf OR atleast do not print anything in between pin toggling.

EDIT:

To be precise you are printing 50 characters in case of DMA transfer and 23 characters in case of CPU transfer.

Upvotes: 5

DMA transfer taking more time than CPU transfer

Answers (3)

Related Questions