How do disk IO operations typically look in kernel-level assembly?

Question

In userland, performing disk IO is as easy as linking against a C library or, if you're adventurous, performing a system call directly. I'm wondering how the kernel itself performs IO.

In other words, suppose I were hypothetically running an application on bare metal in a privileged mode. How would I access disk hardware connected via, say, a SATA connection? Do I perform a load from a pre-determined address? Is there some sort of io-related instruction?

Z.T. · Accepted Answer

Linux has a function call tracer. I suggest you trace an IO request.

Warning: The following was written by me without actually knowing the true details.

Basically, you need to use the PCI API to talk to the disk device to set up Direct Memory Access because you don't want to read disk blocks (or ethernet frames) one byte at a time. So you tell the hardware that some area of memory (beginning with address X and of length N bytes) is the DMA area. You also setup memory caching to know that the data in that area of RAM can change without the CPU writing into it, so even if you're a uniprocessor it's volatile.

Suppose the hardware supports only a single DMA transaction at a time. Then you transmit commands like "read the 512-byte sector number X (i.e. bytes X<<9 through ((X+1)<<9)-1 of the disk) and put it into the DMA area. When you're done, fire an interrupt". The disk controller does its thing (it has an ARM CPU and everything), talks across PCI to the north bridge hub and through it to RAM, bypassing the CPU. When the write is complete (or errors out), an interrupt fires. While this happens, you wait (well, the kernel runs other processes while your process sleeps). Millions of CPU cycles later (10ms is an eternity for a 2Ghz chip), an interrupt fires. The OS is notified that the read was completed. The OS can see the data in RAM. Then it either copies it into user process memory or it is in a shared page and the user process can read it from there. The user process is resumed (well, put on the ready to run queue and eventually runs when the scheduler feels like it).

Writes work by copying data into the DMA space and transmitting the "write the data from the DMA area to sector number X on disk and fire an interrupt when done" command. Then the disk may fire the interrupt when it finished writing or as soon as it has read the data from RAM, in which case fsync doesn't really work and your database and filesystem are corrupted by power failures.

The OS block cache works on whole 4KB ram pages, so it reads 8 sectors at a time, but the idea is the same. New disks have a native API that works with 4KB sectors, but the idea is the same. USB is different from PCI, but the idea is the same. Various high performance hardware have clever APIs for speeding all this up, have multiple transactions in flight at the same time and various controls over their ordering.

Network interfaces that offload TCP/IP probably have an API around packets instead of ethernet frames, because the NIC understands the TCP/IP header.

Block devices that are really network devices hide the translation somewhere (part in hardware, part in firmware, part in software).

In Linux, for my hardware, I think it goes like this:

When module sata_piix is loaded, it tells the OS the PCI device IDs of devices it supports and callsbacks the OS should use, all described in a struct. Generic OS PCI topology code discovers a device with ID 8086:27c0, an ICH7 and finds it in the driver's table so the OS decides this is the right driver for this hardware. In that table the driver will find out it should treat this device as an ICH6 SATA device, later. Since the driver says it supports the device, the OS registers the device with the driver.

From there the control regions of the device are allocated and prepared. DMA is set up. PCI Bus Mastering is enabled (this allows the controller to initiate PCI data transfers to RAM on its own (when it has data ready) instead of waiting for the CPU to initiate the transfer). Interrupt handlers are set up.

The code is generic and supports many generations of hardware, in chronological order:

So it's difficult to read. Tracing would make it much easier.

How do disk IO operations typically look in kernel-level assembly?

Answers (2)

Related Questions