Roshan
Roshan

Reputation: 645

How can i implement multithreading in java to process 2 million text files?

I have to process around 2 million text files and generate there triples.

Suppose I have a txt file xyz.txt(one of the files of 2 million input) , it is processed as below:

start(xyz.txt)---->module1(xyz.tpd)------>module2(xyz.adv)-------->module3(xyz.tpl)

suggest me a logic or concept so that i can process faster and in an optimized way on x64 4GB windows systems.

module1(working): it parses the txt file using a .bat file in which parser is invoked, it is a separate system thread and after 15 seconds it again starts parsing another txt file, and so on....

module2(working): it accepts .tpd file as input and generates .adv file. module3(working): it accepts .adv file as input and generates .tpl(triples).

should i start threads from txt files or at some other point..? i am afraid that if i the CPU get stuck in context switching.

can anyone have a better logic, so that i can try it..!?

Upvotes: 4

Views: 1503

Answers (5)

Zim-Zam O'Pootertoot
Zim-Zam O'Pootertoot

Reputation: 18148

As a starting point, I would create one IO thread and a pool of CPU threads. The IO thread reads in text files and offers them to a BlockingQueue, while the CPU threads take the files from the BlockingQueue and process them. Then profile the application to see how many CPU threads you should use to keep pace with the IO thread (you can also dynamically determine this, e.g. start with one CPU thread and start another when the size of the BlockingQueue exceeds a threshold, probably something along the lines of 20 files). It's possible that you'll find that you only need one CPU thread to keep pace with the IO thread, in which case your program is IO bound and you'll need to e.g. place the text files next to each other on disk (so that you can use sequential reads on all but the first file) or put them on separate disks in order to speed up the application; one idea is to zip the files together and read them in with a ZipInputStream - this will reduce the number of disk seeks when reading the files and will also reduce the amount of data you need to read

Upvotes: 0

Nitin Tripathi
Nitin Tripathi

Reputation: 491

..sounds like a typical batch application needed for data integration. Although, I do not intend to throw hyperlinks without completely understanding your needs at you, but, probably you need a solution which should work in a single VM and over the period of time you like to extend the solution for multiple VM/machines.. and may be we are not dealing with PBs of data to start with.. try Spring Batch not only will it solve the problem in the given context you will learn to structure your thoughts (think vocabulary!) to solve similar problems..

Upvotes: 0

fge
fge

Reputation: 121720

There is not much told here about your hardware environment; but the basic solution would be to use a fixed-size ExecutorService, where the size would, at first, be the number of your execution units:

private static final int NR_CPUS = Runtime.getRuntime().availableProcessors();

// Then:

final ExecutorService executor = Executors.newFixedThreadPool(NR_CPUS);

Then, for each file, you can create a Runnable to process it, and submit it to the thread pool using its .execute() method.

Note that .execute() is asynchronous; if the submitted runnable cannot be run right now, it will be queued.

Upvotes: 1

Dariusz
Dariusz

Reputation: 22251

Most importantly, you have to write the program, profile it, and see where the bottleneck is. It is more than probable that the disk I/O operations will be the bottleneck and no amount of multithreading will solve your problems.

In that case using two(three? four?) separate hard drives may yield more speed gain than the best multithreaded solution.

Furthermore, the general rule is that you should optimize your application only when you have working code and you really know what to optimize. Profile, profile, profile.

Taking the future multithreaded optimizations into account when writing is OK; the architecture should be flexible enough to allow for future optimizations.

Upvotes: 4

Sumit Desai
Sumit Desai

Reputation: 1760

Use a ThreadPoolExecutor .Tune it's parameters like number of active threads and others to suit your environment and system.

Upvotes: 4

Related Questions