YK S
YK S

Reputation: 3440

Efficient Approach to Process Large set of Data in java

I have a billion records which are unsorted, unrelated to each other and I have to call a function processRecord on each record using Java.

The easy way to do so is using a for loop but it is taking a lot of time.

The other way I could think of is using multithreading but the question is how to divide the dataset of records efficiently and among how many threads?

Is there an efficient way to process this large dataset?

Upvotes: 1

Views: 4132

Answers (2)

Bartosz Bilicki
Bartosz Bilicki

Reputation: 13235

Measure Before figuring out which implementation path to choose you should measure how long it takes to process single item. Based on that you could choose size of work chunk submitted to thread pool, queue, cluster. Very small work chunks would increase coordination overhead. Too big work chunk will take long time to be processed so you will have less gradual progress info.

Single machine processing is simpler to implement, troubleshoot maintain and reason about.

Processing on single machine

Use java.util.concurrent.ExecutorService Submit each work piece using submit(Callable<T> task) method https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#submit-java.util.concurrent.Callable-

Create instance of ExecutorService using java.util.concurrent.Executors.newFixedThreadPool(int nThreads). Choose reasonable value for nThreads Nnumber of CPU cores is reasonable startup value. You may add use more threads if there are some blocking IO calls (database, HTTP) in processing.

Processing on multiple machines - cluster Submit processing jobs to cluster processing technologies like Spark, Hadoop, Google BigQuery.

Processing on multiple machines - queue You can submit your records to any queue system (Kafka, RabbitMQ, ActiveMQ, etc). Then have multiple machines that consume items from the queue. You will be able to add/remove consumers at any time. This approach is fine if you do not need to have single place with processing result.

Upvotes: 3

Sergei Nenashev
Sergei Nenashev

Reputation: 51

Parallel stream could be used here to perform parallel processing of your data. By default parallel stream uses pool by one thread less than processors count.

Wide and useful information about that could be found here https://stackoverflow.com/a/21172732/8184084

Upvotes: 1

Related Questions