I have a requirement, where I need to delete thousands of files efficiently. At present, files are deleted in a sequential manner. I want to speed up the deletions, by calling delete in an asynchronous manner, using std::async(). Current Flow: Get the list of files For each file call delete() Desired Flow: Get the list of files For each file: Call AsyncDelete() using std::async() Store the future object in a vector Wait for each of the deletes to be completed and then return I will launch each of the async tasks using std::launch::async , so that it runs on a separate thread. I have following questions: Is async() suited for workloads involving multiple tasks? Or is it better to use threads for such tasks? I read a chapter (Item 35: Prefer task-based programming to thread-based) in Scott Myer's book "Effective Modern C++", where he recommends using task based programming instead of thread-based. How costly is each "async()" call? Does it have any overhead like a thread creation overhead? I am planning to control the number of async tasks called per cycle. For e.g. if 10,000 files are to be deleted, I will call just 100 deletes per cycle, instead of spawning 10,000 async() tasks in one go. I hope the standard library implementation efficiently handles multiple async calls (for e.g. using a thread pool). future() object returned by async() exposes both get() and wait() methods. I read that, get() internally calls wait(). Is it enough to call get() on each of the futures stored in a vector? What if a get() never returns? Is it advisable to use wait_for() with a time out?

Reputation: 6343

Handling multiple std::async calls

I have a requirement, where I need to delete thousands of files efficiently. At present, files are deleted in a sequential manner.

I want to speed up the deletions, by calling delete in an asynchronous manner, using std::async().

Current Flow:

Get the list of files
For each file call delete()

Desired Flow:

Get the list of files
For each file:
1. Call AsyncDelete() using std::async()
2. Store the future object in a vector
Wait for each of the deletes to be completed and then return

I will launch each of the async tasks using std::launch::async, so that it runs on a separate thread.

I have following questions:

Is async() suited for workloads involving multiple tasks? Or is it better to use threads for such tasks? I read a chapter (Item 35: Prefer task-based programming to thread-based) in Scott Myer's book "Effective Modern C++", where he recommends using task based programming instead of thread-based.
How costly is each "async()" call? Does it have any overhead like a thread creation overhead? I am planning to control the number of async tasks called per cycle. For e.g. if 10,000 files are to be deleted, I will call just 100 deletes per cycle, instead of spawning 10,000 async() tasks in one go. I hope the standard library implementation efficiently handles multiple async calls (for e.g. using a thread pool).
future() object returned by async() exposes both get() and wait() methods. I read that, get() internally calls wait(). Is it enough to call get() on each of the futures stored in a vector?
What if a get() never returns? Is it advisable to use wait_for() with a time out?

Upvotes: 0

Answers (3)

user2209008

Reputation:

The bottleneck is the I/O operations and OS level file system operations, delegating thousands of threads to do this is not likely to alleviate that bottleneck -- in fact, you're likely to find that this method will actually slow things down.

As others have mentioned, depending on the size of the files, it might be better to store the data in an internal database rather than abusing the file system.

Otherwise, I'd probably recommend using one thread for file deletion, then you can just wait (or not wait) for the thread to complete.

To answer one of your questions about how costly async is: the implementation of std::async is compiler and OS specific and would be comparable to the overhead of the native threading implementation is on your machine. Really, the best thing to do is to benchmark it yourself.

Upvotes: 1

Martin Bonner supports Monica

Reputation: 29017

As a completely different approach, have you considered moving everything into a database? Deleting thousands of persistent things quickly is just the sort of stuff databases are good at.

Upvotes: 1

Martin Bonner supports Monica

Reputation: 29017

You may find this doesn't actually help as much as you would like. The file system is likely to have kernel level locks (to ensure consistency), and having many threads hitting these locks it likely to cause trouble.

I suggest

Get the list of files.
Divide the list into (say) ten equal chunks (represented by iterator pairs).
Launch ten threads which each delete their own chunk of the list.
Wait for the ten threads to finish.
Experiment with different values of ten.

Upvotes: 3

Handling multiple std::async calls

Answers (3)

Related Questions