Why does distinct work via flatMap, but not via map's "sub-stream"?

Question

I'm reading in lines of text, and creating a list of its unique words (after lowercasing them). I can make this work with flatMap, but can't make it work with a map's "sub" stream. The flatMap seems more concise and "better", but why does distinct work in one context but not the other?

Class top:

import static java.util.stream.Collectors.toList;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public class GetListOfAllWordsInLinesOfText {

   private static final String INPUT = "Line 1
" +
                              "Line 2, which is a really long line
" +
                              "A moderately long line 3
" +
                              "Line 4
";
   private static final Pattern WORD_SEPARATOR_PATTERN = Pattern.compile("\W+");

   public static void main(String[] args) {

Why does this distinct allow duplicates through:

      final List wordList = new ArrayList<>();
      Arrays.stream(INPUT.split("
"))
            .forEach(line -> WORD_SEPARATOR_PATTERN.splitAsStream(line).
                        map(String::toLowerCase)
                        distinct().
                        forEach(wordList::add));

      System.out.println("Output via map:");
      wordList.stream().forEach(System.out::println);

      System.out.println("--------");

Output:

Output via map:
line
1
line
2
which
is
a
really
long
a
moderately
long
line
3
line
4

But this correctly eliminates duplicates?

      final List wordList2 = Arrays.stream(INPUT.split("
")).flatMap(
            WORD_SEPARATOR_PATTERN::splitAsStream).map(String::toLowerCase).
            distinct()
            .collect(toList());

      System.out.println("Output via flatMap:");
      wordList2.stream().forEach(System.out::println);
   }
}

Output:

line
1
2
which
is
a
really
long
moderately
3
4

Here's the full output, including the below peeks. You can see the duplicates being correctly filtered by the flatMap version, but not the map version:

map:

map before distinct -> line
map after distinct -> line
map before distinct -> 1
map after distinct -> 1
map before distinct -> line
map after distinct -> line
map before distinct -> 2
map after distinct -> 2
map before distinct -> which
map after distinct -> which
map before distinct -> is
map after distinct -> is
map before distinct -> a
map after distinct -> a
map before distinct -> really
map after distinct -> really
map before distinct -> long
map after distinct -> long
map before distinct -> line
map before distinct -> a
map after distinct -> a
map before distinct -> moderately
map after distinct -> moderately
map before distinct -> long
map after distinct -> long
map before distinct -> line
map after distinct -> line
map before distinct -> 3
map after distinct -> 3
map before distinct -> line
map after distinct -> line
map before distinct -> 4
map after distinct -> 4
Output via map:
line
1
line
2
which
is
a
really
long
a
moderately
long
line
3
line
4
--------

flatMap:

flatMap before distinct -> line
flatMap after distinct -> line
flatMap before distinct -> 1
flatMap after distinct -> 1
flatMap before distinct -> line
flatMap before distinct -> 2
flatMap after distinct -> 2
flatMap before distinct -> which
flatMap after distinct -> which
flatMap before distinct -> is
flatMap after distinct -> is
flatMap before distinct -> a
flatMap after distinct -> a
flatMap before distinct -> really
flatMap after distinct -> really
flatMap before distinct -> long
flatMap after distinct -> long
flatMap before distinct -> line
flatMap before distinct -> a
flatMap before distinct -> moderately
flatMap after distinct -> moderately
flatMap before distinct -> long
flatMap before distinct -> line
flatMap before distinct -> 3
flatMap after distinct -> 3
flatMap before distinct -> line
flatMap before distinct -> 4
flatMap after distinct -> 4
Output via flatMap:
line
1
2
which
is
a
really
long
moderately
3
4

Full code:

import static java.util.stream.Collectors.toList;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public class GetListOfAllWordsInLinesOfText {

   private static final String INPUT = "Line 1
" +
                              "Line 2, which is a really long line
" +
                              "A moderately long line 3
" +
                              "Line 4
";
   private static final Pattern WORD_SEPARATOR_PATTERN = Pattern.compile("\W+");

   public static void main(String[] args) {

      final List wordList = new ArrayList<>();
      Arrays.stream(INPUT.split("
"))
            .forEach(line -> WORD_SEPARATOR_PATTERN.splitAsStream(line).map(String::toLowerCase)
                  .peek(word -> System.out.println("map before distinct -> " + word)).
                        distinct().
                        peek(word -> System.out.println("map after distinct -> " + word)).
                        forEach(wordList::add));

      System.out.println("Output via map:");
      wordList.stream().forEach(System.out::println);

      System.out.println("--------");

      final List wordList2 = Arrays.stream(INPUT.split("
")).flatMap(
            WORD_SEPARATOR_PATTERN::splitAsStream).map(String::toLowerCase).
                  peek(word -> System.out.println("flatMap before distinct -> " + word)).
            distinct()
                  .peek(word -> System.out.println("flatMap after distinct -> " + word))
            .collect(toList());

      System.out.println("Output via flatMap:");
      wordList2.stream().forEach(System.out::println);
   }
}

Jiri Tousek · Accepted Answer

The first code snippet uses forEach to process each line, and distinct within this forEach - so the duplicities are eliminated, but only within a line, not globally.

See the output for second line, the repeated occurence of 'line' is actually eliminated since it is repeated on the same line.

Why does distinct work via flatMap, but not via map's "sub-stream"?

Answers (1)

Related Questions

Why does distinct work via flatMap, but not via map&#39;s &quot;sub-stream&quot;?

Answers (1)

Related Questions

Why does distinct work via flatMap, but not via map's "sub-stream"?