Reputation: 20163
I'm reading in lines of text, and creating a list of its unique words (after lowercasing them). I can make this work with flatMap, but can't make it work with a map's "sub" stream. The flatMap seems more concise and "better", but why does distinct work in one context but not the other?
Class top:
import static java.util.stream.Collectors.toList;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class GetListOfAllWordsInLinesOfText {
private static final String INPUT = "Line 1\n" +
"Line 2, which is a really long line\n" +
"A moderately long line 3\n" +
"Line 4\n";
private static final Pattern WORD_SEPARATOR_PATTERN = Pattern.compile("\\W+");
public static void main(String[] args) {
Why does this distinct allow duplicates through:
final List<String> wordList = new ArrayList<>();
Arrays.stream(INPUT.split("\n"))
.forEach(line -> WORD_SEPARATOR_PATTERN.splitAsStream(line).
map(String::toLowerCase)
distinct().
forEach(wordList::add));
System.out.println("Output via map:");
wordList.stream().forEach(System.out::println);
System.out.println("--------");
Output:
Output via map:
line
1
line
2
which
is
a
really
long
a
moderately
long
line
3
line
4
But this correctly eliminates duplicates?
final List<String> wordList2 = Arrays.stream(INPUT.split("\n")).flatMap(
WORD_SEPARATOR_PATTERN::splitAsStream).map(String::toLowerCase).
distinct()
.collect(toList());
System.out.println("Output via flatMap:");
wordList2.stream().forEach(System.out::println);
}
}
Output:
line
1
2
which
is
a
really
long
moderately
3
4
Here's the full output, including the below peek
s. You can see the duplicates being correctly filtered by the flatMap version, but not the map version:
map:
map before distinct -> line
map after distinct -> line
map before distinct -> 1
map after distinct -> 1
map before distinct -> line
map after distinct -> line
map before distinct -> 2
map after distinct -> 2
map before distinct -> which
map after distinct -> which
map before distinct -> is
map after distinct -> is
map before distinct -> a
map after distinct -> a
map before distinct -> really
map after distinct -> really
map before distinct -> long
map after distinct -> long
map before distinct -> line
map before distinct -> a
map after distinct -> a
map before distinct -> moderately
map after distinct -> moderately
map before distinct -> long
map after distinct -> long
map before distinct -> line
map after distinct -> line
map before distinct -> 3
map after distinct -> 3
map before distinct -> line
map after distinct -> line
map before distinct -> 4
map after distinct -> 4
Output via map:
line
1
line
2
which
is
a
really
long
a
moderately
long
line
3
line
4
--------
flatMap:
flatMap before distinct -> line
flatMap after distinct -> line
flatMap before distinct -> 1
flatMap after distinct -> 1
flatMap before distinct -> line
flatMap before distinct -> 2
flatMap after distinct -> 2
flatMap before distinct -> which
flatMap after distinct -> which
flatMap before distinct -> is
flatMap after distinct -> is
flatMap before distinct -> a
flatMap after distinct -> a
flatMap before distinct -> really
flatMap after distinct -> really
flatMap before distinct -> long
flatMap after distinct -> long
flatMap before distinct -> line
flatMap before distinct -> a
flatMap before distinct -> moderately
flatMap after distinct -> moderately
flatMap before distinct -> long
flatMap before distinct -> line
flatMap before distinct -> 3
flatMap after distinct -> 3
flatMap before distinct -> line
flatMap before distinct -> 4
flatMap after distinct -> 4
Output via flatMap:
line
1
2
which
is
a
really
long
moderately
3
4
Full code:
import static java.util.stream.Collectors.toList;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class GetListOfAllWordsInLinesOfText {
private static final String INPUT = "Line 1\n" +
"Line 2, which is a really long line\n" +
"A moderately long line 3\n" +
"Line 4\n";
private static final Pattern WORD_SEPARATOR_PATTERN = Pattern.compile("\\W+");
public static void main(String[] args) {
final List<String> wordList = new ArrayList<>();
Arrays.stream(INPUT.split("\n"))
.forEach(line -> WORD_SEPARATOR_PATTERN.splitAsStream(line).map(String::toLowerCase)
.peek(word -> System.out.println("map before distinct -> " + word)).
distinct().
peek(word -> System.out.println("map after distinct -> " + word)).
forEach(wordList::add));
System.out.println("Output via map:");
wordList.stream().forEach(System.out::println);
System.out.println("--------");
final List<String> wordList2 = Arrays.stream(INPUT.split("\n")).flatMap(
WORD_SEPARATOR_PATTERN::splitAsStream).map(String::toLowerCase).
peek(word -> System.out.println("flatMap before distinct -> " + word)).
distinct()
.peek(word -> System.out.println("flatMap after distinct -> " + word))
.collect(toList());
System.out.println("Output via flatMap:");
wordList2.stream().forEach(System.out::println);
}
}
Upvotes: 4
Views: 2853
Reputation: 12440
The first code snippet uses forEach
to process each line, and distinct
within this forEach
- so the duplicities are eliminated, but only within a line, not globally.
See the output for second line, the repeated occurence of 'line' is actually eliminated since it is repeated on the same line.
Upvotes: 6