I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstream s reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this? Immediately I can think of two ways, Construct a new ifstream for the second thread, open it on the same file. Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<> ). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice. Is one of these two methods preferred? Is there a third (or fourth) option that I have not yet thought of? Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously. Thanks.

Reputation: 5136

Process same file in two threads using ifstream

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?

Immediately I can think of two ways,

Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.

Is one of these two methods preferred?

Is there a third (or fourth) option that I have not yet thought of?

Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.

Thanks.

Upvotes: 7

Answers (5)

James Kanze

Reputation: 153919

It really depends on your system. A modern system will generally read ahead; seeking within the file is likely to inhibit this, so should definitly be avoided.

It might be worth experimenting how read-ahead works on your system: open the file, then read the first half of it sequentially, and see how long that takes. Then open it, seek to the middle, and read the second half sequentially. (On some systems I've seen in the past, a simple seek, at any time, will turn off read-ahead.) Finally, open it, then read every other record; this will simulate two threads using the same file descriptor. (For all of these tests, use fixed length records, and open in binary mode. Also take whatever steps are necessary to ensure that any data from the file is purged from the OS's cache before starting the test—under Unix, copying a file of 10 or 20 Gigabytes to /dev/null is usually sufficient for this.

That will give you some ideas, but to be really certain, the best solution would be to test the real cases. I'd be surprised if sharing a single ifstream (and thus a single file descriptor), and constantly seeking, won, but you never know.

I'd also recommend system specific solutions like mmap, but if you've got that much data, there's a good chance you won't be able to map it all in one go anyway. (You can still use mmap, mapping sections of it at a time, but it becomes a lot more complicated.)

Finally, would it be possible to get the data already cut up into smaller files? That might be the fastest solution of all. (Ideally, this would be done where the data is generated or imported into the system.)

Upvotes: 1

Nicolas

Reputation: 1116

My vote would be a single reader, which hands the data to multiple worker threads.

If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Upvotes: 0

Ben Voigt

Reputation: 283634

Other option:

Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).

Upvotes: 4

Cory Nelson

Reputation: 29981

Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.

If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.

Upvotes: 13

Billy ONeal

Reputation: 106549

Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.

For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.

Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.

Upvotes: 7

Process same file in two threads using ifstream

Answers (5)

Related Questions