Tom Dufall
Tom Dufall

Reputation: 981

Aggregate logs by field value and plot as multiple series using AWS CloudWatch Insights

Analysing some log files using AWS CloudWatch Insights, I can plot a count aggregated in time bins with:

| stats count(*) by bin(1h)

This produces a graph, as expected, aggregating all logs in each time bin.

I want to split this data by a 'group' field, with values A and B.

| stats count(*) by group, bin(1h)

This returns log counts across time bins as expected, but the visualisation tab says 'No visualisation available.' I would like it to return a time series plot with a series for group A and a series for group B.

Where am I going wrong, or is this simply not possible?

Upvotes: 50

Views: 50833

Answers (6)

matemaciek
matemaciek

Reputation: 621

When faced with such problem I ended up writing a python script to generate the query:

errors = [
    'Some error message',
    'Another error message',
    'And another'
]
fields = ', '.join(f"@message like '{e}' as E{i}" for i,e in enumerate(errors))
other = '+'.join(f"E{i}" for i in range(len(errors)))
stats = ', '.join(f"sum(E{i}) as {e.replace(' ', '_').replace('=','').replace(':','').replace(',','')}" for i,e in enumerate(errors))
print(rf"""
fields @timestamp, @message
| filter level = 'error'
| fields {fields}, {other} == 0 as other
#| parse @message /(?<msg>[^\d\[]*).*/ | filter other == 1 | stats count() by msg
| stats {stats}, sum(other) as Other by bin(1d)
""")

This outputs the following query:

fields @timestamp, @message
| filter level = 'error'
| fields @message like 'Some error message' as E0, @message like 'Another error message' as E1, @message like 'And another' as E2, E0+E1+E2 == 0 as other
#| parse @message /(?<msg>[^\d\[]*).*/ | filter other == 1 | stats count() by msg
| stats sum(E0) as Some_error_message, sum(E1) as Another_error_message, sum(E2) as And_another, sum(other) as Other by bin(1d)

(The commented out line is an alternative to last line, to help catch common errors that can be added to list in python code)

Upvotes: 0

Nikhil Owalekar
Nikhil Owalekar

Reputation: 708

Although it's still not possible to view multiple timeseries at a time in CloudWatch Logs Insights, I was able to do this using Grafana.

My query:

fields @timestamp, @log, @logStream, @message
| filter ispresent(ContainerName) and ContainerName like /my-app-prod/ and Type = "Container"
| filter ispresent(TaskId)
| stats avg(CpuUtilized) by bin(1m), TaskId

In CloudWatch Logs Insights, the above query will give me minute by minute CPU Utilization of each task. And if I click on visualize it gives me an error.

But the same query in Grafana (> 7.x) shows the following result

CPU Utilization by TaskId

Upvotes: 1

Esko Luontola
Esko Luontola

Reputation: 73625

Here is another variation based on the answers of @smth et al. Not being able to build a chart automatically based on the values is very annoying, so you need to mention each value explicitly.

fields @timestamp, @message
| filter namespace like "audit-log"
| fields coalesce(`audit.client-version`, "legacy") as version
| parse version "legacy" as v0
| parse version "886ac066*" as v1
| parse version "93e021e2*" as v2
| stats count(*), count(v0), count(v1), count(v2) by bin(5m)

Result: CloudWatch Logs Insights visualization of the above query

Upvotes: 17

6ugr3
6ugr3

Reputation: 491

Expanding on the sample of @smth, i usually do it a little different,

with this query i follow the trend of status codes aggregated over time on a standard nginx access log

fields @timestamp, @message
| parse @message '* - * [*] "* * *" * * "-" "*"' as host, identity, dateTimeString, httpVerb, url, protocol, status, bytes, useragent
| stats count (*) as all, sum ( status < 299 ) as c_s200, sum ( status > 299 and status < 399 ) as c_s300, sum ( status > 399 and status < 499 ) as c_s400, sum ( status > 499 ) as c_s500 by bin (1m)

The trick on this is that expressions like "status > 499" returns 0 if false and 1 if true, and so, adding it up on the time bucket allows to simulate something like a 'count if [condition]'

And so does the sample generated graph look like, on the visualisation tab.

amount of each stausCode aggregated on 1 minute buckets

Upvotes: 34

Jared Stewart
Jared Stewart

Reputation: 616

There is also an extension to the workaround described by @smth that can support more complicated statistics than count(). Here's an example that graphs the CPU usage across different instances over time:

| fields (instance_id like "i-instance_1") as is_instance_1, (instance_id like "i-instance_2") as is_instance_2
| stats sum(cpu * is_instance_1) as cpu_1, sum(cpu * is_instance_2) as cpu_2 by bin(5m)

Upvotes: 11

smth
smth

Reputation: 662

Alright, I found a very janky way to solve this. It appears that no, you cannot do

| stats count(*) by group, bin(1h)

However, you can use parse to artificially create new variables, like so:

parse "[E*]" as @error
| parse "[I*]" as @info
| parse "[W*]" as @warning
| filter ispresent(@warning) or ispresent(@error) or ispresent(@info)
| stats count(@error) as error, count(@info) as info, count(@warning) as warning by bin(15m)

Here, I'm trying to look at the distribution of log types over time. I have three types of log messages with formats: "[ERROR]", "[INFO]" and "[WARNING]"

Upvotes: 45

Related Questions