Catarina Nogueira
Catarina Nogueira

Reputation: 1138

Cumulate word count by Flink timeWindow using Scala

I have the twitter streaming API and I am retrieving tweets from there.
I also have a list of desired words that I want to take into account.

What I want to do is to store to my Cassandra dataBase always the most accurate value corresponding to how many times the word was used on the day.

I was thinking of using window functions to consolidade the results each 5 seconds and then writing this consolidate value on the database.

I don't know if this is the best approach. If this is the best approach, I tried to do a simple example following the documentation, but it doesn't group the words each 5 seconds.


    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val counts =
      env.fromElements("foo bar test test baz foo", "yes no no yes", "hi hello hi hello")
      .flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .filter(word => Words.listOfWords.contains(word) || Words.listOfWords2.contains(word))
      .map { (_, 1) }
      .keyBy(0)
      .timeWindow(Time.seconds(5)).sum( 1)

    counts.print()
    env.execute("test-code")

  }

Upvotes: 0

Views: 146

Answers (1)

Dominik Wosiński
Dominik Wosiński

Reputation: 3864

Well, currently it will not work, because You are creating the DataStream from elements, which is not the best idea for windowing, because You won't really have a 5 seconds of the runtime to create more than one window, so all of the messages will go to the same window. But, if you would run this on the actual Twitter API, this should generally group the items into windows properly.

Upvotes: 1

Related Questions