magnanimousllamacopter
magnanimousllamacopter

Reputation: 406

AWS Cloudwatch Log Insights: Aggregate results are impossible (count - count_distinct is negative)

I'm running a CloudWatch log insights query on a single log stream that corresponds to a single Python AWS Lambda function. This function logs a unique line corresponding to the key in s3 that it is processing. It logs this line once at the beginning of the invocation. The only condition where it won't log this line is if it fails before it even reads the event.

The query is:

parse @message /(?<@unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter @message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(@unique_key) - count_distinct(@unique_key) as @distinct_unique_keys_delta
        by datefloor(@timestamp, 1d) as @_datefloor 
| sort @_datefloor asc

The two regular expressions in this query will parse the full key of the s3 file being processed. In this particular problem and in general, my understanding is that the count(...) of any quantity minus the count_distinct(...) of the same quantity should always be greater than or equal to zero.

For several of the days in the results, it is a negative number.

I thought I might be misunderstanding the correct usage of datefloor(), so I tried running the following query:

parse @message /(?<@unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter @message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(@unique_key) - count_distinct(@unique_key) as @distinct_unique_keys_delta

The result was -20,347.

At this point the only scenarios I can see are

  1. Something wrong with the code executing the query.
  2. I'm misunderstanding this tool.

Upvotes: 0

Views: 1800

Answers (1)

magnanimousllamacopter
magnanimousllamacopter

Reputation: 406

I have discovered that the count_distinct function in AWS Log Insights queries doesn't really return a distinct count! As per the documentation

Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.

Apparently I can't just assume that a function returns an accurate result.

The documentation page.

Upvotes: 1

Related Questions