Reputation: 91
I'm working with List<String>
-- it contais a big text. Text looks like:
List<String> lines = Arrays.asList("The first line", "The second line", "Some words can repeat", "The first the second"); //etc
I need to calculate words in it with output:
first - 2
line - 2
second - 2
can - 1
repeat - 1
some - 1
words - 1
Words shorter than 4 symbols should be skipped, that's why "the" and "can" are not in the output. Here I wrote the example, but originally if the word is rare and entry < 20, i should skip this word. Then sort the map by Key in alphabetical order. Using only streams, without "if", "while" and "for" constructions.
What I have implemented:
Map<String, Integer> wordCount = Stream.of(list)
.flatMap(Collection::stream)
.flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
.filter(str -> (str.length() >= 4))
.collect(Collectors.toMap(
i -> i.toLowerCase(),
i -> 1,
(a, b) -> java.lang.Integer.sum(a, b))
);
wordCount contains Map with words and its entries. But how can I skip rare words? Should I create new stream? If yes, how can I get the value of Map? I tried this, but it's not correct:
String result = Stream.of(wordCount)
.filter(i -> (Map.Entry::getValue > 10));
My calculations shoud return a String:
"word" - number of entries
Thank you!
Upvotes: 0
Views: 996
Reputation: 40034
You can't exclude any values that are less than rare
until you have computed the frequency count.
Here is how I might go about it.
TreeMap
to sort the words in lexical orderList<String> list = Arrays.asList(....);
int wordRarity = 10; // minimum frequency to accept
int wordLength = 4; // minimum word length to accept
Map<String, Long> map = list.stream()
.flatMap(str -> Arrays.stream(
str.split("\\p{Punct}|\\s+|[0-9]|…|«|»|“|„")))
.filter(str -> str.length() >= wordLength)
.collect(Collectors.groupingBy(String::toLowerCase,
Collectors.counting()))
// here is where the rare words are filtered out.
.entrySet().stream().filter(e->e.getValue() > wordRarity)
.collect(Collectors.toMap(Entry::getKey, Entry::getValue,
(a,b)->a,TreeMap::new));
}
Note that the (a,b)->a
lambda is a merge function to handle duplicates and is not used. Unfortunately, one cannot specify a Supplier without specifying the merge function.
The easiest way to print them is as follows:
map.entrySet().forEach(e -> System.out.printf("%s - %s%n",
e.getKey(), e.getValue()));
Upvotes: 2
Reputation: 2776
Given the stream that already done:
List<String> lines = Arrays.asList(
"For the rabbit, it was a bad day.",
"An Antillean rabbit is very abundant.",
"She put the rabbit back in the cage and closed the door securely, then ran away.",
"The rabbit tired of her inquisition and hopped away a few steps.",
"The Dean took the rabbit and went out of the house and away."
);
Map<String, Integer> wordCounts = Stream.of(lines)
.flatMap(Collection::stream)
.flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
.filter(str -> (str.length() >= 4))
.collect(Collectors.toMap(
String::toLowerCase,
i -> 1,
Integer::sum)
);
System.out.println("Original:" + wordCounts);
Original output:
Original:{dean=1, took=1, door=1, very=1, went=1, away=3, antillean=1, abundant=1, tired=1, back=1, then=1, house=1, steps=1, hopped=1, inquisition=1, cage=1, securely=1, rabbit=5, closed=1}
You can do:
String results = wordCounts.entrySet()
.stream()
.filter(wordToCount -> wordToCount.getValue() > 2) // 2 is rare
.sorted(Map.Entry.comparingByKey()).map(wordCount -> wordCount.getKey() + " - " + wordCount.getValue())
.collect(Collectors.joining(", "));
System.out.println(results);
Filtered output:
away - 3, rabbit - 5
Upvotes: 2