time4tea
time4tea

Reputation: 2197

Searching Athena AWS WAF logs by http header

I've set up AWS WAF Logging to S3, and created an Athena table as described in the documentation ( https://docs.aws.amazon.com/athena/latest/ug/waf-logs.html )

However, http headers are stored as an array<struct<name: string, value: string>> not a map (for valid reasons). I'd like to write a query like

select headers.user-agent, headers.if-none-match from waf_logs where something;

if-none-match may or may not be present in the headers list.

using CROSS JOIN UNNEST(httprequest.headers) doesn't work, as this will create multiple rows. Using map notation doesn't work, as its a array of struct, not a map.

There are a lot of pages on the internet about how to set up the table, but not so many with worked examples of real-life queries, and none that I can find about how to query by nested attributes.

I'd really appreciate any suggestions. Thanks!

Upvotes: 1

Views: 4704

Answers (3)

tom10271
tom10271

Reputation: 4649

I strongly recommend all developers not to use UNNEST if possible, you don't want cartesian products most of the time and you should definitely want to avoid unnesting multiple times as the number of rows will be unimaginably high. Just cast httprequest.headers then use filter to extract value.

It is also tedious and unstable to unnest then group again just to get the value and if you need to extract multiple values from httprequest.headers by key, you simply cannot get the 100% correct result easily and stably.

Remember, most developers think they know SQL well and mastered it, only a few realize how hard to get it right and how easy SQL returns an illusional and convincible result making you think you wrote the correct SQL and got correct result.

WITH
    targets AS (
        SELECT
            *,
            cast(
                httprequest.headers as ARRAY(ROW(name VARCHAR, value VARCHAR))
            ) as headers
        FROM your_app.waf_logs
        WHERE
            "action" NOT IN ('BLOCK', 'ALLOW')
          AND date >= '2023/04/18'
    ),
    result AS (
        SELECT
            to_iso8601(from_unixtime(timestamp / 1000)) as time_ISO_8601,
            terminatingruleid,
            labels,
            try(
                filter(
                    headers,
                    x -> LOWER(x.name) = 'user-agent'
                )[1].value
            ) AS UserAgent,
            try(
                filter(
                    headers,
                    x -> LOWER(x.name) = 'x-forwarded-for'
                )[1].value
            ) AS XForwardedForIP
        FROM targets
    )
SELECT
    *
FROM result
;

Upvotes: 2

Tyrone321
Tyrone321

Reputation: 1902

Here's a WAF query that should do the trick for request headers:

WITH waf_data AS (
  SELECT
    waf.action as action,
    waf.httprequest.clientip as clientip,
    waf.httprequest.country as country,
    map_agg(f.name, f.value) AS kv
  FROM "waf_logs" waf,
  UNNEST(waf.httprequest.headers) AS t(f)
  GROUP BY 1, 2, 3 
)
SELECT
  waf_data.action,
  waf_data.clientip,
  waf_data.country,
  waf_data.kv['Host'] AS host,
  waf_data.kv['User-Agent'] as UserAgent,
  waf_data.kv['Cookie'] as cookie
FROM waf_data
WHERE waf_data.kv['Host'] like 'waf_alb.us-east-2.elb.amazonaws.com'
LIMIT 10;

Upvotes: 2

Bertrand Martel
Bertrand Martel

Reputation: 45352

I used the following to extract the HTTP header values (here Host field by name):

SELECT action, header.value as hostname, clientip, timestamp
from (
    SELECT 
      httprequest.clientip as clientip, 
      action,
      timestamp, 
      httprequest.headers as headers 
    FROM waf_logs 
) 
cross join unnest(headers) as c(header)
where lower(header.name) = 'host'

I used this post to extract the array of row which is formatted like array(row("name" varchar,"value" varchar)) in the waf logs database

Upvotes: 0

Related Questions