WenJuan
WenJuan

Reputation: 684

mmap very slow when using O_SYNC

Brief description of our project: We are using CycloneV in our project, FPGA will write data to DDR using AXI bus and our application needs to send the data out using Ethernet. We benchmark our Ethernet throughput using iperf and it can achieve a speed of about 700Mbps. When we test our application throughput, the result we are getting is just 400Mbps. We write a simple server code without using /dev/mem, then populate the memory with random data using dd command and the application reads the file to send out. We notice that the throughput is actually near to iperf benchmark. We found out that when we remove O_SYNC during open /dev/mem, the throughput can be achieved close to that of iperf. But the issue now is that we get intermittent wrong data if we don't use O_SYNC.

We allocate the contiguous memory using dma_alloc_coherent:

p_ximageConfig->fpgamem_virt = dma_alloc_coherent(NULL, Dma_Size, &(p_ximageConfig->fpgamem_phys), GFP_KERNEL);

and we pass the phys memory to userspace to mmap using IOCTL:

uint32 DMAPHYSADDR = getDmaPhysAddr();
pImagePool = ((volatile unsigned char*)mmap( 0,MAPPED_SIZE_BUFFER, PROT_READ|PROT_WRITE, MAP_SHARED, _fdFpga, DMAPHYSADDR));

We have tried following methods:

  1. Writing our own mmap in our driver: We still get wrong data intermittently if we do not sync. Sync method that we tried is pgprot_noncached and pgprot_dmacoherent but it can only achieve 300Mbps.

  2. We tried to use dma_mmap_coherent: The result we get is about 500Mbps.

Is there any method that can help us achieve a performance that is close to iperf performance?

Upvotes: 0

Views: 2772

Answers (1)

y s
y s

Reputation: 26

I don't know why iperf is so fast, but how mmap'ing device memory works.

Let's look at mmap_mem() function, which is called by user's mmap call. According to this line, this function maps memory as noncached if O_SYNC is specified, and as (maybe) writeback else. So doing vma->vm_page_prot = __pgprot_modify(vma->vm_page_prot, L_PTE_MT_MASK, L_PTE_MT_WRITEBACK); may make it faster.

So here we enabled a cache of a memory area. Then how to synchronize the content with FPGA?

One way is to synchronize by software. There are dmac_map_area() and dmac_unmap_area() calls which correspond to v7_dma_map_area() and v7_dma_unmap_area() accordingly. These functions take three parameters: user address addr, the size size and DMA direction dir.

When we call dmac_map_area(addr, size, DMA_TO_DEVICE), the content of CPU cache is written to memory. So do this when the CPU has finished to write to memory and the device is going to read from the location.

When we call dmac_unmap_area(addr, size, DMA_FROM_DEVICE), the content of CPU cache is marked as "invalid", and when we read from the location the new content from device is read to CPU cache. So do this when the device has finished to write to memory and the CPU is going to read from the location.

The other way is to use dedicated hardware. According to this pdf, Cyclone V has Accelerator Coherency Port (ACP), which enables FPGA to read cache contents of ARM. I think this maybe faster than software, but because I don't know how to use ACP, please try googling.

Upvotes: 1

Related Questions