Reputation: 4510

Excessive thread count yields better results with file reading

I have a hundred million files, my program reads all these files at every startup. I have been looking for ways to make this process faster. On the way, I've encountered something strange. My CPU has 4 physical cores, but reading this many files with even higher thread counts yields much better results. Which is interesting, given that opening threads more than the logical core count of the CPU should be somewhat pointless.

8     Threads: 29.858 s
16    Threads: 15.882 s
32    Threads: 9.989 s
64    Threads: 7.965 s
128   Threads: 8.275 s
256   Threads: 8.159 s
512   Threads: 8.098 s
1024  Threads: 8.253 s
4096  Threads: 8.744 s
16001 Threads: 10.033 s

Why this may occur ? Is it some disk bottleneck ?

Did the homework, profiled the code, literally %95 of the runtime consists of read(), open() and close()
I am reading the first 4096 bytes of every file (my pagesize)
Ubuntu 18.04
Intel i7 6700HQ
Samsung 970 Evo Plus NVMe SSD
GCC/G++ 11

Upvotes: 3

Answers (1)

Brendan

Reputation: 37232

Why this may occur ?

If you open one file at "/a/b/c/d/e" then read one block of data from the file; the OS may have to fetch directory info for "/a", then fetch directory info for "/a/b", then fetch directory info for "/a/b/c", then... It might add up to a total of 6 blocks fetched from disk (5 blocks of directory info then one block of file data), and those blocks might be scattered all over the disk.

If you open a 100 million files and read one block of file data from each; then this might involve fetching 600 million things (500 million pieces of directory info, and 100 million pieces of file data).

What is the optimal order to do these 600 million things?

Often there's directory info caches and file data caches involved (and all requests that can be satisfied by data that's already cached should be done ASAP, before that data is evicted out of cache/s to make room for other data). Often the disk hardware also has rules (e.g. faster to access all blocks within the same "group of disk blocks" before switching to the next "group of disk blocks"). Sometimes there's parallelism in the disk hardware (e.g. two requests from the same zone can't be done in parallel, but 2 requests from different zones can be done in parallel).

The optimal order to do these 600 million things is something the OS can figure out.

More specifically; the optimal order to do these 600 million things is something the OS can figure out; if and only if the OS actually knows about all of them.

If you have (e.g.) 8 threads that send one request (e.g. to open a file) and then block (using no CPU time) until the pending request completes; then the OS will only know about a maximum of 8 requests at a time. In other words; the operating system's ability to optimize the order that file IO requests are performed is constrained by the number of pending requests, which is constrained by the number of threads you have.

Ideally; a single thread would be able to ask the OS "open all the files in this list of a hundred million files" so that the OS can fully optimize the order (with the least thread management overhead). Sadly, most operating systems don't support anything like this (e.g. POSIX asynchronous IO fails to support any kind of "asynchronous open").

Having a large number of threads (that are all blocked and not using any CPU time while they wait for their request/s to actually be done by file system and/or disk driver) is the only way to improve the operating system's ability to optimize the order of IO requests.

Upvotes: 4

Excessive thread count yields better results with file reading

Answers (1)

Related Questions