lambda foreach parallelStream creating less data than expected

Question

I'm trying to implement a lambda foreach parallel Stream of an arraylist to improve performance of an existing application.

So far the foreach iteration without a parallel Stream creates the expected amount of data written into a database.

But when I switch to a parallelStream it always writes less rows into the database. Let's say from 10.000 expected, nearly 7000 rows, but result varies here.

Any idea what I am missing here, data race conditions, or must I work with locks and synchronized?

The code does something like this basically:

// Create Persons from an arraylist of data

arrayList.parallelStream()
          .filter(d -> d.personShouldBeCreated())
          .forEach(d -> {

   // Create a Person
   // Fill it's properties
   // Update object, what writes it into a DB

  }
);

Things I tried so far

Collect the result in an new List with...

collect(Collectors.toList())

...and then iterating over the new list and executing the logic, described in the first code snippet. The size of the new 'collected' ArrayList matches with the expected result, but at the end there is still less data created in the database.

Update/Solution:

Based on the answer I marked (and also the hints in the comments) concerning non-thread safe parts in that code, I implemented it as the following, what finally gives me the expected amount of data. Performance has improved, it takes now only 1/3 of the implementation before.

StringBuffer sb = new StringBuffer();
arrayList()
  .parallelStream()
  .filter(d-> d.toBeCreated())
  .forEach(d ->
    sb.append(
            // Build an application specific XML for inserting or importing data
    )
  );

The application specific part is an XML based data import api, but I think this could be done in plain SQL JDBC inserts.

Valentin Ruano · Accepted Answer

Most likely your code within the lambda is not thread safe because the code uses shared non concurrent data-structures or their manipulation requires locking

I suspect a batch/bulk insert is going to be faster that a parallel version which probably would end in sprawling short live connections that would compete between them for locking the tables your are inserting too.

Perhaps you could have some gains in terms of composing the bulk insert file contents in parallel though but that would depend on how a bulk insert can be realized thru your database API... does it need to be dump into a text file first? in that case your parallel stream could compose the different lines of that text in parallel and finally join them into the text file to load into the DB. Perhaps instead of a text file it allow you to use a collection/list of statement objects in memory, in this case you parallel-stream could create those objects in parallel and collect them into the final collection/list to be bulk inserted to your DB.

lambda foreach parallelStream creating less data than expected

Answers (1)

Related Questions