xtx
xtx

Reputation: 4446

What are the options to process timeseries data from a Kinesis stream

I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.


Say, I have two devices A and B that write events into the stream. My procedure has name of MyFunction and takes the following params:

If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds) then I need to make two calls:

In the next second, at 10:00:01

and so on.


Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.

And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.

I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)

So my questions is: what can be other options to implement such processing?

Upvotes: 0

Views: 410

Answers (1)

az3
az3

Reputation: 3629

So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.

  • In memory cache systems: Ehcache, Hazelcast

Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.

  • In memory database: Redis, VoltDB

Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.

It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.

Upvotes: 1

Related Questions