Catherin Zeta Jones
Catherin Zeta Jones

Reputation: 91

How to count words in Map via Stream

I'm working with List<String> -- it contais a big text. Text looks like:

List<String> lines = Arrays.asList("The first line", "The second line", "Some words can repeat", "The first the second"); //etc

I need to calculate words in it with output:

first - 2
line - 2
second - 2
can - 1
repeat - 1
some - 1
words - 1

Words shorter than 4 symbols should be skipped, that's why "the" and "can" are not in the output. Here I wrote the example, but originally if the word is rare and entry < 20, i should skip this word. Then sort the map by Key in alphabetical order. Using only streams, without "if", "while" and "for" constructions.

What I have implemented:

Map<String, Integer> wordCount = Stream.of(list)
                .flatMap(Collection::stream)
                .flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
                .filter(str -> (str.length() >= 4))
                .collect(Collectors.toMap(
                        i -> i.toLowerCase(),
                        i -> 1,
                        (a, b) -> java.lang.Integer.sum(a, b))
                );

wordCount contains Map with words and its entries. But how can I skip rare words? Should I create new stream? If yes, how can I get the value of Map? I tried this, but it's not correct:

 String result = Stream.of(wordCount)
         .filter(i -> (Map.Entry::getValue > 10));

My calculations shoud return a String:

"word" - number of entries

Thank you!

Upvotes: 0

Views: 996

Answers (2)

WJS
WJS

Reputation: 40034

You can't exclude any values that are less than rare until you have computed the frequency count.

Here is how I might go about it.

  • do the frequency count (I chose to do it slightly differently than you).
  • then stream the entrySet of the map and filter out values less than a certain frequency.
  • then reconstruct the map using a TreeMap to sort the words in lexical order
List<String> list = Arrays.asList(....);

int wordRarity = 10; // minimum frequency to accept
int wordLength = 4; // minimum word length to accept
        
Map<String, Long> map = list.stream()
        .flatMap(str -> Arrays.stream(
                str.split("\\p{Punct}|\\s+|[0-9]|…|«|»|“|„")))
        .filter(str -> str.length() >= wordLength)
        .collect(Collectors.groupingBy(String::toLowerCase, 
                Collectors.counting()))
        // here is where the rare words are filtered out.
        .entrySet().stream().filter(e->e.getValue() > wordRarity)
        .collect(Collectors.toMap(Entry::getKey, Entry::getValue,
                (a,b)->a,TreeMap::new));
    }

Note that the (a,b)->a lambda is a merge function to handle duplicates and is not used. Unfortunately, one cannot specify a Supplier without specifying the merge function.

The easiest way to print them is as follows:

map.entrySet().forEach(e -> System.out.printf("%s - %s%n",
                e.getKey(), e.getValue()));

Upvotes: 2

Most Noble Rabbit
Most Noble Rabbit

Reputation: 2776

Given the stream that already done:

List<String> lines = Arrays.asList(
        "For the rabbit, it was a bad day.",
        "An Antillean rabbit is very abundant.",
        "She put the rabbit back in the cage and closed the door securely, then ran away.",
        "The rabbit tired of her inquisition and hopped away a few steps.",
        "The Dean took the rabbit and went out of the house and away."
);

Map<String, Integer> wordCounts = Stream.of(lines)
        .flatMap(Collection::stream)
        .flatMap(str -> Arrays.stream(str.split("\\p{Punct}| |[0-9]|…|«|»|“|„")))
        .filter(str -> (str.length() >= 4))
        .collect(Collectors.toMap(
                String::toLowerCase,
                i -> 1,
                Integer::sum)
        );

System.out.println("Original:" + wordCounts);

Original output:

Original:{dean=1, took=1, door=1, very=1, went=1, away=3, antillean=1, abundant=1, tired=1, back=1, then=1, house=1, steps=1, hopped=1, inquisition=1, cage=1, securely=1, rabbit=5, closed=1}

You can do:

String results = wordCounts.entrySet()
        .stream()
        .filter(wordToCount -> wordToCount.getValue() > 2) // 2 is rare
        .sorted(Map.Entry.comparingByKey()).map(wordCount -> wordCount.getKey() + " - " + wordCount.getValue())
            .collect(Collectors.joining(", "));

System.out.println(results);

Filtered output:

away - 3, rabbit - 5

Upvotes: 2

Related Questions