How to avoid high cpu usage while reading/writing character device?

Question

I need to write a linux kernel driver for a PCIe device with SRAM.

For the first attempt, I've written a driver to access SRAM from PCIe with a character device.

Everything works as expected, but there is one problem. SRAM is slow 1MB takes about 2 secs to read/write, this is a hardware limitation. The CPU is 100% busy while reading/writing. Witch is a problem. I don't need speed, reading/writing can be slow, but why it takes so much CPU?

The buffer is initialized with pci_iomap:

  g_mmio_buffer[0] = pci_iomap(pdev, SRAM_BAR_H, g_mmio_length);

read/write functions looks like this:

static ssize_t dev_read(struct file *fp, char *buf, size_t len, loff_t *off) {
  unsigned long rval;
  size_t copied;

  rval = copy_to_user(buf, g_mmio_buffer[SRAM_BAR] + *off, len);

  if (rval < 0) return -EFAULT;

  copied = len - rval;
  *off += copied;

  return copied;
}

static ssize_t dev_write(struct file *fp, const char *buf, size_t len, loff_t *off) {
  unsigned long rval;
  size_t copied;

  rval = copy_from_user(g_mmio_buffer[SRAM_BAR] + *off, buf, len);

  if (rval < 0) return -EFAULT;

  copied = len - rval;
  *off += copied;

  return copied;
}

The question is what can I do with high CPU usage?

Should I rewrite the driver to use a block device instead of a character?

Allow CPU to work on another process while reading/saving data?

Ian Abbott · Accepted Answer

As pointed out by @0andriy, you are not supposed to access iomem directly. There are functions such as memcpy_toio() and memcpy_fromio() that can copy between iomem and normal memory, but they only work on kernel virtual addresses.

NOTE: The use of get_user_pages_fast(), set_page_dirty_lock() and put_page() described below should be changed for Linux kernel version 5.6 onwards. The required changes are described later.

In order to copy from userspace addresses to iomem without using an intermediate data buffer, the userspace memory pages need to be "pinned" into physical memory. That can be done using get_user_pages_fast(). However, the pinned pages may be in "high memory" (highmem) which is outside the permanently mapped memory in the kernel. Such pages need to be temporarily mapped into kernel virtual address space for a short duration using kmap_atomic(). (There are rules governing the use of kmap_atomic(), and there are other functions for longer term mapping of highmem. Check the highmem documentation for details.)

Once a userspace page has been mapped to kernel virtual address space, memcpy_toio() and memcpy_fromio() can be used to copy between that page and iomem.

A page temporarily mapped by kmap_atomic() needs to be unmapped by kunmap_atomic().

User memory pages pinned by get_user_pages_fast() need to be unpinned individually by calling put_page(), but if the page memory has been written to (e.g. by memcpy_fromio(), it must first be flagged as "dirty" by set_page_dirty_lock() before calling put_page().

Note: Change for kernel version 5.6 onwards.

The call to get_user_pages_fast() should be changed to pin_user_pages_fast().

Dirty pages pinned by pin_user_pages_fast() should be unpinned by unpin_user_pages_dirty_lock() with the last argument set true.

Clean pages pinned by pin_user_pages_fast() should be unpinned by unpin_user_page(), unpin_user_pages(), or unpin_user_pages_dirty_lock() with the last argument set false.

put_page() must not be used to unpin pages pinned by pin_user_pages_fast().

For code to be compatible with earlier kernel versions, the availability of pin_user_pages_fast(), unpin_user_page(), etc. can be determined by whether the FOLL_PIN macro has been defined by #include .

Putting all that together, the following functions may be used to copy between user memory and iomem:

#include 
#include 
#include 
#include 
#include 

/**
 * my_copy_to_user_from_iomem - copy to user memory from MMIO
 * @to:     destination in user memory
 * @from:   source in remapped MMIO
 * @n:      number of bytes to copy
 * Context: process
 *
 * Returns number of uncopied bytes.
 */
long my_copy_to_user_from_iomem(void __user *to, const void __iomem *from,
                unsigned long n)
{
    might_fault();
    if (!access_ok(to, n))
        return n;
    while (n) {
        enum { PAGE_LIST_LEN = 32 };
        struct page *page_list[PAGE_LIST_LEN];
        unsigned long start;
        unsigned int p_off;
        unsigned int part_len;
        int nr_pages;
        int i;

        /* Determine pages to do this iteration. */
        p_off = offset_in_page(to);
        start = (unsigned long)to - p_off;
        nr_pages = min_t(int, PAGE_ALIGN(p_off + n) >> PAGE_SHIFT,
                 PAGE_LIST_LEN);
        /* Lock down (for write) user pages. */
#ifdef FOLL_PIN
        nr_pages = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, page_list);
#else
        nr_pages = get_user_pages_fast(start, nr_pages, FOLL_WRITE, page_list);
#endif
        if (nr_pages <= 0)
            break;

        /* Limit number of bytes to end of locked-down pages. */
        part_len =
            min(n, ((unsigned long)nr_pages << PAGE_SHIFT) - p_off);

        /* Copy from iomem to locked-down user memory pages. */
        for (i = 0; i < nr_pages; i++) {
            struct page *page = page_list[i];
            unsigned char *p_va;
            unsigned int plen;

            plen = min((unsigned int)PAGE_SIZE - p_off, part_len);
            p_va = kmap_atomic(page);
            memcpy_fromio(p_va + p_off, from, plen);
            kunmap_atomic(p_va);
#ifndef FOLL_PIN
            set_page_dirty_lock(page);
            put_page(page);
#endif
            to = (char __user *)to + plen;
            from = (const char __iomem *)from + plen;
            n -= plen;
            part_len -= plen;
            p_off = 0;
        }
#ifdef FOLL_PIN
        unpin_user_pages_dirty_lock(page_list, nr_pages, true);
#endif
    }
    return n;
}
    
/**
 * my_copy_from_user_to_iomem - copy from user memory to MMIO
 * @to:     destination in remapped MMIO
 * @from:   source in user memory
 * @n:      number of bytes to copy
 * Context: process
 *
 * Returns number of uncopied bytes.
 */
long my_copy_from_user_to_iomem(void __iomem *to, const void __user *from,
                unsigned long n)
{
    might_fault();
    if (!access_ok(from, n))
        return n;
    while (n) {
        enum { PAGE_LIST_LEN = 32 };
        struct page *page_list[PAGE_LIST_LEN];
        unsigned long start;
        unsigned int p_off;
        unsigned int part_len;
        int nr_pages;
        int i;

        /* Determine pages to do this iteration. */
        p_off = offset_in_page(from);
        start = (unsigned long)from - p_off;
        nr_pages = min_t(int, PAGE_ALIGN(p_off + n) >> PAGE_SHIFT,
                 PAGE_LIST_LEN);
        /* Lock down (for read) user pages. */
#ifdef FOLL_PIN
        nr_pages = pin_user_pages_fast(start, nr_pages, 0, page_list);
#else
        nr_pages = get_user_pages_fast(start, nr_pages, 0, page_list);
#endif
        if (nr_pages <= 0)
            break;

        /* Limit number of bytes to end of locked-down pages. */
        part_len =
            min(n, ((unsigned long)nr_pages << PAGE_SHIFT) - p_off);

        /* Copy from locked-down user memory pages to iomem. */
        for (i = 0; i < nr_pages; i++) {
            struct page *page = page_list[i];
            unsigned char *p_va;
            unsigned int plen;

            plen = min((unsigned int)PAGE_SIZE - p_off, part_len);
            p_va = kmap_atomic(page);
            memcpy_toio(to, p_va + p_off, plen);
            kunmap_atomic(p_va);
#ifndef FOLL_PIN
            put_page(page);
#endif
            to = (char __iomem *)to + plen;
            from = (const char __user *)from + plen;
            n -= plen;
            part_len -= plen;
            p_off = 0;
        }
#ifdef FOLL_PIN
        unpin_user_pages(page_list, nr_pages);
#endif
    }
    return n;
}

Secondly, you might be able to speed up memory access by mapping the iomem as "write combined" by replacing pci_iomap() with pci_iomap_wc().

Thirdly, the only real way to avoid wait-stating the CPU when accessing slow memory is to not use the CPU and use DMA transfers instead. The details of that very much depend on your PCIe device's bus-mastering DMA capabilities (if it has any at all). User memory pages still need to be pinned (e.g. by get_user_pages_fast() or pin_user_pages_fast() as appropriate) during the DMA transfer, but do not need to be temporarily mapped by kmap_atomic().

How to avoid high cpu usage while reading/writing character device?

Answers (1)

Related Questions