Reputation: 5720
I have Stream of Stream of Words(This format is not set by me and cannot be changed). For ex
Stream<String> doc1 = Stream.of("how", "are", "you", "doing", "doing", "doing");
Stream<String> doc2 = Stream.of("what", "what", "you", "upto");
Stream<String> doc3 = Stream.of("how", "are", "what", "how");
Stream<Stream<String>> docs = Stream.of(doc1, doc2, doc3);
I'm trying to get this into a structure of Map<String, Multiset<Integer>>
(or its corresponding stream as I want to process this further), where the key String
is the word itself and the Multiset<Integer>
represents the number of that word appearances in each document (0's should be excluded). Multiset is a google guava class(not from java.util.).
For example:
how -> {1, 2} // because it appears once in doc1, twice in doc3 and none in doc2(so doc2's count should not be included)
are -> {1, 1} // once in doc1 and once in doc3
you -> {1, 1} // once in doc1 and once in doc2
doing -> {3} // thrice in doc3, none in others
what -> {2,1} // so on
upto -> {1}
What is a good way to do this in Java 8 ?
I tried using a flatMap , but the inner Stream is greatly limiting the options of I have.
Upvotes: 11
Views: 1007
Reputation: 1254
Here is the simple solution by AbacusUtil:
Map<String, List<Integer>> m = Stream.of(doc1, doc2, doc3)
.flatMap(d -> d.toMultiset().stream()).collect(Collectors.toMap2());
Upvotes: 1
Reputation: 34460
Since you are using Guava, you could take advantage of its utilities to work with streams. Same with the Table
structure. Here's the code:
Table<String, Long, Long> result =
Streams.mapWithIndex(docs, (doc, i) -> doc.map(word -> new SimpleEntry<>(word, i)))
.flatMap(Function.identity())
.collect(Tables.toTable(
Entry::getKey, Entry::getValue, p -> 1L, Long::sum, HashBasedTable::create));
Here I'm using the Streams.mapWithIndex
method to assign an index to each inner stream. Within the map function, I'm transforming each word to a pair that consists of the word and the index, so that I can later know to which document the word belongs.
Then, I'm flat-mapping the pairs (word, index)
of all documents to one stream, and finally, I'm collecting all the pairs to a Guava Table
by means of the Tables.toTable
collector. The row is the word, the column is the document (represented by the index) and the value is the count of words for each document (I'm assigning 1L
to each different (word, index)
pair and using Long::sum
to merge collisions).
You have all the info you need in the result
table, but if you still need a Map<String, Multiset<Integer>>
, you could do it this way:
Map<String, Multiset<Long>> map = Maps.transformValues(
result.rowMap(),
m -> HashMultiset.create(m.values()));
Note: you need Guava 21 for this to work.
Upvotes: 3
Reputation: 3453
Map<String, Multiset<Integer>> result = docs
.map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
.flatMap(m -> m.entrySet().stream())
.collect(Collectors.groupingBy(Multiset.Entry::getElement,
Collectors.mapping(Multiset.Entry::getCount,
Collectors.toCollection(HashMultiset::create))));
// {upto=[1], how=[1, 2], doing=[3], what=[1, 2], are=[1 x 2], you=[1 x 2]}
Multiset is useful for getting the word count, but not really necessary for storing the counts. If you're fine with Map<String, List<Integer>>
, just replace the last line with Collectors.toList())));
.
Or, since you're using Guava anyway, why not a ListMultimap?
ListMultimap<String, Integer> result = docs
.map(s -> s.collect(Collectors.toCollection(HashMultiset::create)))
.flatMap(m -> m.entrySet().stream())
.collect(ArrayListMultimap::create,
(r, e) -> r.put(e.getElement(), e.getCount()),
Multimap::putAll);
// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}
Upvotes: 3
Reputation: 120968
Map<String, List<Long>> map = docs.flatMap(
inner -> inner.collect(
Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet()
.stream())
.collect(Collectors.groupingBy(
Entry::getKey,
Collectors.mapping(Entry::getValue, Collectors.toList())));
System.out.println(map);
// {upto=[1], how=[1, 2], doing=[3], what=[2, 1], are=[1, 1], you=[1, 1]}
Upvotes: 10