Reputation: 5823
There seems to be a consensus on StackOverflow that if one reads a large file in full, then sequential reading is fastest and multi-threading of the reading is not likely to give a benefit (e.g., 1, 2, and several more).
Now, in the code example below, multi-threaded reading is actually faster, and by a lot (I have seen 2x, even up to 3x with 1000GB files). Why is that?
sequential: 41s
parallel: 27s
I am reading from a Samsung SSD 990 PRO 4TB on a Xeon w9-3495X 56-core system. When reading sequentially, the SSD active time is around 75%, so it is somewhat understandable that I can achieve higher rates from multi-threading. But why is the SSD active time not at 100% to begin with?
I noticed that the CPU load of the process is 2% with 1 thread and 7% with 4 threads, which is both close to 100% / 56 * nThreads
, and maybe that is already the answer. Still, what keeps the CPU so busy during std::filebuf::sgetn
? Is there a faster way to read the file that can improve single-threaded read performance as well?
#include <chrono>
#include <fstream>
#include <ios>
#include <iostream>
#include <memory>
#include <thread>
#include <vector>
// fsutil file createnew 100GB 100000000000
constexpr auto filename = "100GB";
constexpr auto bufferSize = 6'000'000;
constexpr auto nThreads = 4;
template<typename Callback>
void timeit(const char * message, const Callback & callback) {
using namespace std::chrono;
std::cout << message << ": ";
const auto start = high_resolution_clock::now();
callback();
std::cout << duration_cast<seconds>(high_resolution_clock::now() - start) << std::endl;
}
static void readFile(const size_t nThreads = 1, const size_t iThread = 0) {
std::filebuf file;
file.open(filename, std::ios::in | std::ios::binary);
const auto buffer = std::make_unique_for_overwrite<char[]>(bufferSize);
if (iThread > 0) {
file.pubseekoff(iThread * bufferSize, std::ios_base::cur);
}
while (file.sgetn(buffer.get(), bufferSize)) {
if (nThreads > 1) {
file.pubseekoff((nThreads - 1) * bufferSize, std::ios_base::cur);
}
}
}
int main() {
timeit("sequential", [] { readFile(); });
timeit("parallel", [] {
std::vector<std::jthread> threads;
for (int iThread = 0; iThread < nThreads; iThread++) {
threads.emplace_back(readFile, nThreads, iThread);
}
});
}
Upvotes: 2
Views: 315
Reputation: 17444
There are multiple possible bottlenecks:
If a process triggers a read operation, you have following steps:
Looking at these repeating steps, you will find that some resource is always idle, waiting for one or two of the others to complete their task. That means that not one of these resources is 100% used and it strongly suggests that this could be improved. Doing multiple reads in parallel will simply allow overlapping the above steps in a way that makes better use of those resources, resulting in an increased throughput.
For HDDs, accessing different parts of the disk meant that the magnetic heads had to be moved mechanically and you had to wait until the disks had spun to the correct location. OSs optimized this by laying out files in a way that subsequent blocks of the file followed each other closely on disk, to reduce this seek time. This only improved sequential reads, not random access, which is what is caused by multiple threads consuming different parts of the file.
For SSDs, there is zero or negligible seek time, so it performs much better on the randomly staggered reads caused by using multiple threads. Many guidelines on optimizing disk operations from the time before SSDs need to be taken with a grain of salt therefore. Check when something was written and try to understand whether it's trying to reduce seek operations before blindly applying it to a SSD-based system.
Upvotes: 4