PankajK
PankajK

Reputation: 307

Duplicate keys are present in output

I am trying out Apache Flink, and to test my knowledge from the learning, I am playing with the classic Word Count problem.

Here's my code:

public class TestWordCount {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        DataStreamSource<String> addSource = env.addSource(new TestSource());

        DataStream<Tuple2<String, Integer>> sum = addSource
        .flatMap(new Tokenizer())
        .keyBy(0)
        .sum(1);

        sum.print();
        env.execute();
    }

}

class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {

    private static final long serialVersionUID = 1L;

    @Override
    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
        for(String part: value.split(" "))
            out.collect(new Tuple2<>(part.toLowerCase(), 1));
    }
}

class TestSource implements SourceFunction<String> {

    private static final long serialVersionUID = 1L;
    String s = "Hadoop is the Elephant King! A yellow and elegant thing. He never forgets. The Useful data, or lets An extraneous element cling!";

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        ctx.collect(s);
    }

    @Override
    public void cancel() {
    }
}

When I am running it, the output is like this:

(hadoop,1) (is,1) (the,1) (elephant,1) (king!,1) (a,1) (yellow,1) (and,1) (elegant,1) (thing.,1) (he,1) (never,1) (forgets.,1) (the,2) (useful,1) (data,,1) (or,1) (lets,1) (an,1) (extraneous,1) (element,1) (cling!,1)

I am just curious to know, why the is coming twice, as (the,1) and (the,2)?

Help would be much appreciated.

Upvotes: 0

Views: 226

Answers (2)

David Anderson
David Anderson

Reputation: 43717

When working with data streams, the input is unbounded, and so it's not possible to wait until "the end" to print out the results. The concept of a "final report" is meaningless. So what you get instead is a continuously updating stream of the results so far.

Upvotes: 1

Jiayi Liao
Jiayi Liao

Reputation: 1009

Why the is comming twice?

I believe that you've sent the "the" twice. And the (the, 1) is the count when you sent the first "the", the (the, 2) is the count when you sent the second "the".

The sum will aggregate the data every time it receive an element and output it.

Upvotes: 0

Related Questions