r0b0t1
r0b0t1

Reputation: 1

mmap, axi and multiple reads from pcie

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.

My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).

I can successfully read any word from the bus but I notice a pattern when requesting data:

The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?

Here's how I use mmap:

        length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
    mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
    mmap_len = (u_long)(tmp_size * sizeof(int));
    mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
            PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);

    close(fd);
    // tmp_reg_buf = new u_int[tmp_size];
    // memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
  
    // for(int i = 0; i < 4; i++)
    //   printf("0x%08X\n", tmp_reg_buf[i]);
    
    for(int i = 0; i < tmp_size; i++)
      printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));

Upvotes: 0

Views: 905

Answers (1)

Jamey Hicks
Jamey Hicks

Reputation: 2370

First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.

PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.

If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.

Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.

The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.

If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.

The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

Upvotes: 0

Related Questions