Reputation: 902
Context: I'm not necessarily referring to a KCL-based application, just pure Kinesis API calls.
Does the using the TRIM_HORIZON
shard iterator type immediately give you the earliest published record in the stream (ie earliest available within Kinesis' built-in 24hr window), or simply an iterator/cursor for some time period as much as 24 hours ago, that you must then use to advance along the stream until you hit the earliest published record?
Put another way, in case that's not quite clear....
When using the shard iterator type of TRIM_HORIZON
, is the expected behavior that it will begin with returning the records that were available 24 hours ago, BUT if zero records were published exactly 24 hours ago, and instead only 3 hours ago, that your application will need to iteratively poll through the previous 21 hours before it reaches the records published 3 hours ago?
Timeline example:
GetShardIterator
call with TRIM_HORIZON
as your shard iterator type, then issue a GetRecords
call with that shard iterator and receive the record "Item=A"GetShardIterator
call with TRIM_HORIZON
as your shard iterator type, then issue a GetRecords
call with that shard iterator. What should be expected as the result from this call? (Note: we did not remember/re-use the shard iterator from step 3)For Step 5 above, it's been more than 24 hours since the "Item=A" message was published on the stream and only a minute since "Item=B" was published. Will a fresh shard iterator with TRIM_HORIZON
immediately give you the earliest available record, or do you need to need to keep iterating until you hit a time period when something has been published?
I'd been experimenting with Kinesis and everything was working fine yesterday or two days ago (ie. I was publishing AND consuming without any issues). I made some additional modifications to my code and began publishing again today. When I fired up my consumer, nothing was coming out at all even after letting it run for a few minutes. I tried publishing and consuming at exactly the same time, and still nothing. After manually playing with the AFTER_SEQUENCE_NUMBER
iterator type, and using some sequence numbers from my consumer logs from a few days ago, I was able to reach my recently published messages. But then if I go back to using the TRIM_HORIZON
type, I see no messages at all.
I've looked at the docs, but most of docs I found assume you are using the KCL (I actually was using KCL initially, but when it started failing I dropped down to raw API calls) and mention that you must have an application name and that DynamoDB tables are used for tracking state. Which as best I can tell is not true if you're using pure Kinesis API calls or the Kinesis CLI, both of which I eventually tried. I finally wrote a pure API script to start with TRIM_HORIZON
and poll infinitely and eventually it hit new records (took ~600 iterations; started out 14hrs behind "now" and found records at about 5 hours behind "now"). If this is expected behavior, it seems like the wording in the docs is just a little confusing/misleading:
TRIM_HORIZON - Start reading at the last untrimmed record in the shard in the system, which is the oldest data record in the shard.
I assumed (now seemingly incorrectly) that the terms "oldest data record" meant record that I've published into the stream, not simply a time period in the stream.
It'd be great if someone can help confirm/explain the behavior I'm seeing.
Thanks!
Upvotes: 14
Views: 7143
Reputation: 975
TRIM_HORIZON gives the oldest record in the stream.
Just that sometimes on giving TRIM_HORIZON as the shard_iterator_type :-
Suppose the value of "millis_behind_latest" in the kinesis response is ~86399000 & your stream retention period is 24 hours(86400000)
By the time you use the shard_iterator to retrieve the record, the record is no longer in the stream as the retention period of the record has been exceeded. Hence you get an empty result because the oldest record has expired and no longer there in the data stream. So the shard_iterator is now pointing to an empty space in the disk.
When such a thing happens take the value of "next_shard_iterator" and use get_records to once again get the kinesis data records.
Also another thing is we do not completely know how AWS manages each shard in the data stream. How data is erased and added into it. Maybe data is not stored in concurrent/contiguous memory memory blocks and hence we get empty results in between retrieval of data.
Keep taking the value of "next_shard_iterator" and use get_records until you get a value of 0 for "millis_behind_latest".
Hope this answer helps. :)
Upvotes: 1
Reputation: 10566
it's at the TRIM HORIZON, or the HORIZON where the stream TRIMming happens.
the shard iterator may get 0 records when called, so you'll need to keep iterating to reach the area where the oldest record is (if you push infrequently to the stream or have time gaps). the getRecords will give you the next shard iterator you can use to iterate.
from doc: http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html
If there are no records available in the portion of the shard that the iterator points to, GetRecords returns an empty list. Note that it might take multiple calls to get to a portion of the shard that contains records.
Upvotes: 1