Reputation: 307
I am trying out Apache Flink, and to test my knowledge from the learning, I am playing with the classic Word Count problem.
Here's my code:
public class TestWordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> addSource = env.addSource(new TestSource());
DataStream<Tuple2<String, Integer>> sum = addSource
.flatMap(new Tokenizer())
.keyBy(0)
.sum(1);
sum.print();
env.execute();
}
}
class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 1L;
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
for(String part: value.split(" "))
out.collect(new Tuple2<>(part.toLowerCase(), 1));
}
}
class TestSource implements SourceFunction<String> {
private static final long serialVersionUID = 1L;
String s = "Hadoop is the Elephant King! A yellow and elegant thing. He never forgets. The Useful data, or lets An extraneous element cling!";
@Override
public void run(SourceContext<String> ctx) throws Exception {
ctx.collect(s);
}
@Override
public void cancel() {
}
}
When I am running it, the output is like this:
(hadoop,1) (is,1) (the,1) (elephant,1) (king!,1) (a,1) (yellow,1) (and,1) (elegant,1) (thing.,1) (he,1) (never,1) (forgets.,1) (the,2) (useful,1) (data,,1) (or,1) (lets,1) (an,1) (extraneous,1) (element,1) (cling!,1)
I am just curious to know, why the
is coming twice, as (the,1)
and (the,2)
?
Help would be much appreciated.
Upvotes: 0
Views: 226
Reputation: 43717
When working with data streams, the input is unbounded, and so it's not possible to wait until "the end" to print out the results. The concept of a "final report" is meaningless. So what you get instead is a continuously updating stream of the results so far.
Upvotes: 1
Reputation: 1009
Why the is comming twice?
I believe that you've sent the "the" twice. And the (the, 1) is the count when you sent the first "the", the (the, 2) is the count when you sent the second "the".
The sum will aggregate the data every time it receive an element and output it.
Upvotes: 0