Reputation: 25291
I have a bunch of files (on the order of 10 per second) coming into a system (stored into a database). Each file contains an entry for somewhere between 1 and 500 devices. A given device will appear in multiple files (but not every file). This data eventually needs to be stored in another database, stored per-device. There are two different file formats.
There is an API that takes care of the final database part, which takes several entries for a single device (behind the scenes, this also does some lookups to find IDs in the database, and so processing multiple entries at once for a single device means doing the lookups once, instead of once per entry).
To do this, I have a program with several parts:
My question is: what is the best way to manage when to parse files, how many threads to use, how much RAM etc?
So how do I know when to parse files to make sure that this is going as fast as it can, without causing a performance hit by using too much RAM?
Upvotes: 3
Views: 317
Reputation: 273274
It seems like you have a system that is very much I/O bound (files on he input side and DB on the output side). I don't see any CPU intensive parts in there.
The obvious optimization is already in the question: bunch a whole lot of incoming files and group the data per device. The cost is memory consumption and latency in Db updates. You'll need parameters for that.
As a first idea, I would set it up in 3 blocks connected by bounded-queues. Those queues will let any component that is 'overwhelmed' throttle its suppliers.
block 1: 1 or 2 threads (depends on I/O system) to read and parse files,
block 2: 1 thread to organize and group data. Decide when device-data should go to the Db
block 3: 1+ threads pushing data to the Db.
The blocks give this system some flexibility. The limited queues let you control resource consumption. Note that block 2 should be parametrized to tune block-size.
Upvotes: 1
Reputation: 39695
This is how I would do it. As each new file comes in, add it to a queue. Have a dispatcher pick up a file and start a new thread.
The dispatcher can constantly monitor available system memory and cpu usage (using for example the performance counter api).
As long as there is enough free memory or low enough cpu load, launch a new thread. You would have to test a bit to find the optimal thresholds for your application.
Also, if you are running on 32bit, then one process can only use around ~800mb of ram before you get an out of memory exception, so you might need to take that into consideration as well.
Your third factor for starting new work is the DB API. As long as it can swallow your added work, keep on adding more threads.
The flow of the program would be something like this:
Upvotes: 0