Reputation: 600
I am trying to build a data collection pipe-line on top of AWS services. Overal architecture is given below;
In summary system should get events from API gateway (1) ( one request for each event ) and the data should be written to Kinesis (2).
I am expecting ~100k events per second. My question is related to KPL usage on Lambda functions. On step 2 I am planning to write a Lambda method with KPL to write events on Kinesis with high throughput. But I am not sure it is possible as API Gateway calls lambda function for each event separately.
Is it possible/reasonable to use KPL in such architecture or I should using Kinesis Put API instead?
1 2 3 4
+----------------+ +----------------+ +----------------+ +----------------+
| | | | | | | |
| | | | | | | |
| AWS API GW +-----------> | AWS Lambda +-----------> | AWS Kinesis +----------> | AWS Lambda |
| | | Function with | | Streams | | |
| | | KPL | | | | |
| | | | | | | |
+----------------+ +----------------+ +----------------+ +-----+-----+----+
| |
| |
| |
| |
| |
5 | | 6
+----------------+ | | +----------------+
| | | | | |
| | | | | |
| AWS S3 <-------+ +----> | AWS Redshift |
| | | |
| | | |
| | | |
+----------------+ +----------------+
I am also thinking about writing directly to S3 instead of calling lambda function from api-gw. If first architecture is not reasonable this may be a solution but in that case I will have a delay till writing data to kinesis
1 2 3 4 5
+----------------+ +----------------+ +----------------+ +----------------+ +----------------+
| | | | | | | | | |
| | | | | | | | | |
| AWS API GW +-----------> | AWS Lambda +------> | AWS Lambda +-----------> | AWS Kinesis +----------> | AWS Lambda |
| | | to write data | | Function with | | Streams | | |
| | | to S3 | | KPL | | | | |
| | | | | | | | | |
+----------------+ +----------------+ +----------------+ +----------------+ +-----+-----+----+
| |
| |
| |
| |
| |
6 | | 7
+----------------+ | | +----------------+
| | | | | |
| | | | | |
Upvotes: 0
Views: 1764
Reputation: 391
Obviously, if your data coming through AWS API Gateway corresponds to one Kinesis Data Streams record it makes no sense to use the KPL as pointed out by Jens. In this case you can make direct call of Kinesis API without using Lambda. Eventually, you may use some additional processing in Lambda and send the data through PutRecord (not PutRecords used by KPL). Your code in JAVA will looks like this
AmazonKinesisClientBuilder clientBuilder = AmazonKinesisClientBuilder.standard();
clientBuilder.setRegion(REGION);
clientBuilder.setCredentials(new DefaultAWSCredentialsProviderChain());
clientBuilder.setClientConfiguration(new ClientConfiguration());
AmazonKinesis kinesisClient = clientBuilder.build();
...
//then later on each record
PutRecordRequest putRecordRequest = new PutRecordRequest();
putRecordRequest.setStreamName(STREAM_NAME);
putRecordRequest.setData(data);
putRecordRequest.setPartitionKey(daasEvent.getAnonymizedId());
putRecordRequest.setExplicitHashKey(Utils.randomExplicitHashKey());
putRecordRequest.setSequenceNumberForOrdering(sequenceNumberOfPreviousRecord);
PutRecordResult putRecordResult = kinesisClient.putRecord(putRecordRequest);
sequenceNumberOfPreviousRecord = putRecordResult.getSequenceNumber();
However, there may be cases when using KPL from lambda makes sense. For example the data sent to AWS API Gateway contains multiple individual records which will be sent to one or multiple streams. In that cases the benefits (see https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-concepts.html) of KPL are still valid, but you have to be aware of specifics given by using of Lambda concretely an "issue" pointed out here https://github.com/awslabs/amazon-kinesis-producer/issues/143 and use
kinesisProducer.flushSync()
at the end of insertions which worked also for me.
Upvotes: 1
Reputation: 14029
I do not think using KPL is the right choice here. The key concept of KPL is, that records get collected at the client and then send as a batch operation to Kinesis. Since Lambdas are stateless per invocation, it would be rather difficult to store the records for aggregation (before sending it to Kinesis).
I think you should have a look at the following AWS article which explain how you can directly connect API-Gateway to Kinesis. This way, you can avoid the extra Lambda which just forwards your request.
Create an API Gateway API as an Kinesis Proxy
Upvotes: 2