Does Linux read() copy data into the process address space

Question

I am trying to understand a specific difference between read() and mmap(). I have a basic/decent understanding of both but there's something fundamental that I'm not getting.

I would imagine the answer is pretty simple here but here's the question:

Let's say you open a file "test.txt" which is not present in the file cache and you want to read the first 64 bytes. My understanding is that the first 4k of bytes are read into the page cache and then 64 bytes are copied into the buffer for the read() call.

My questions:

1) When you read in the data via read() and the 4k is stored in the file system cache, does that take up your process's virtual memory address space or is that just disk cache space that could/will be paged out later? I know that mmap will map the file (or portion of file) into the process address space but I couldn't figure out if read() uses process address space. My guess is it does NOT because read() doesn't allow you to randomly access portions of the file (is that correct?).

2) That 64 bytes that is copied into the buffer returned by the read() to be used by the process, does this data take up process address space or just disk space cache?

Andrew Henle · Accepted Answer

My understanding is that the first 4k of bytes are read into the page cache and then 64 bytes are copied into the buffer for the read() call.

In general, that is correct. (But there are always exceptions - in this case direct I/O. You really don't ever need to worry much about that unless you're dealing with some I/O corner cases...)

When you read in the data via read() and the 4k is stored in the file system cache, does that take up your process's virtual memory address space or is that just disk cache space that could/will be paged out later?

The latter - the disk cache is memory in kernel space that, well, caches contents of data on disk. And it can be paged out (as can most pages of memory).

That 64 bytes that is copied into the buffer returned by the read() to be used by the process, does this data take up process address space or just disk space cache?

The data is copied from the disk cache (kernel memory) into the buffer that's in user space. So the data is in both places. (Which is a reason for direct I/O - the extra copy step and the extra copy of the data itself is eliminated)

I/O performance is a complex subject. What's fastest in one case may not be even remotely the fastest in another. Everything from CPU speed to memory bandwidth to PCI bus bandwidth to disk controller characteristics to SATA/SAS/SCSI/FC/iSCSI bandwidth and latency to actual physical disk performance specifics matter. How data is laid out on disk(s) matters. How data is accessed matters. It's pretty much impossible to state something like mmap() is faster than read() - or the other way around.

Think of getting the best I/O performance as similar to impedance matching speakers on a high-end stereo system for the best sound, but with a whole lot more variables affecting the "best" answer. To get the absolute best performance, everything has to match - from the actual layout of data on physical disk(s) to the exact pattern(s) the user-space applications use to access the data.

And in general it's really not worth bothering with - almost every out-of-the-box setup will get you at least 80% or so of the maximum possible performance your hardware can deliver as long as you don't do something bad like read a file in reverse a single character at a time.

Does Linux read() copy data into the process address space

Answers (1)

Related Questions