Cumulate word count by Flink timeWindow using Scala

Question

I have the twitter streaming API and I am retrieving tweets from there.
I also have a list of desired words that I want to take into account.

What I want to do is to store to my Cassandra dataBase always the most accurate value corresponding to how many times the word was used on the day.

I was thinking of using window functions to consolidade the results each 5 seconds and then writing this consolidate value on the database.

I don't know if this is the best approach. If this is the best approach, I tried to do a simple example following the documentation, but it doesn't group the words each 5 seconds.


    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val counts =
      env.fromElements("foo bar test test baz foo", "yes no no yes", "hi hello hi hello")
      .flatMap { _.toLowerCase.split("\W+") filter { _.nonEmpty } }
      .filter(word => Words.listOfWords.contains(word) || Words.listOfWords2.contains(word))
      .map { (_, 1) }
      .keyBy(0)
      .timeWindow(Time.seconds(5)).sum( 1)

    counts.print()
    env.execute("test-code")

  }

Cumulate word count by Flink timeWindow using Scala

Answers (1)

Related Questions