user3001
user3001

Reputation: 3487

How to read all lines of a file in parallel in Java 8

I want to read all lines of a 1 GB large file as fast as possible into a Stream<String>. Currently I'm using Files(path).lines() for that. After parsing the file, I'm doing some computations (map()/filter()).

At first I thought this is already done in parallel, but it seems I'm wrong: when reading the file as it is, it takes about 50 seconds on my dual CPU laptop. However, if I split the file using bash commands and then process them in parallel, it only takes about 30 seconds.

I tried the following combinations:

  1. single file, no parallel lines() stream ~ 50 seconds
  2. single file, Files(..).lines().parallel().[...] ~ 50 seconds
  3. two files, no parallel lines() strean ~ 30 seconds
  4. two files, Files(..).lines().parallel().[...] ~ 30 seconds

I ran these 4 multiple times with roughly the same results (by 1 or 2 seconds). The [...] is a chain of map and filter only, with a toArray(...) at the end to trigger the evaluation.

The conclusion is that there is no difference in using lines().parallel(). As reading two files in parallel takes a shorter time, there is a performance gain from splitting the file. However it seems the whole file is read serially.

Edit:
I want to point out that I use an SSD, so there is practically no seeking time. The file has 1658652 (relatively short) lines in total. Splitting the file in bash takes about 1.5 seconds:

   time split -l 829326 file # 829326 = 1658652 / 2
   split -l 829326 file  0,14s user 1,41s system 16% cpu 9,560 total

So my question is, is there any class or function in the Java 8 JDK which can parallelize reading all lines without having to split it first? For example, if I have two CPU cores, the first line reader should start at the first line and a second one at line (totalLines/2)+1.

Upvotes: 21

Views: 28396

Answers (1)

matthewmatician
matthewmatician

Reputation: 332

You might find some help from this post. Trying to parallelize the actual reading of a file is probably barking up the wrong tree, as the biggest slowdown will be your file system (even on an SSD).

If you set up a file channel in memory, you should be able to process the data in parallel from there with great speed, but chances are you won't need it as you'll see a huge speed increase.

Upvotes: 7

Related Questions