Reputation: 972
Problem : Running the Crawler with a classifier with right gork pattern doesn't create the table with columns instead table with 0 columns and recordCount 0 is created(but objectCount is 5)
Details : I set up a Glue Crawler to look at a s3 bucket which has s3-access-logs. This Glue Crawler uses a Classifier to classify columns for each entry in the log file.
The Classifier is setup with a Gork Pattern below
%{NOTSPACE:session_uuid} %{NOTSPACE:bucket_name} \[%{DATA:timestamp}\] %{IP:ip_address} %{NOTSPACE:principle} %{NOTSPACE:request_uuid} %{NOTSPACE:bucket_action} %{NOTSPACE:resource} \"%{DATA:resource_action}\" %{NOTSPACE:http_status} %{NOTSPACE:http_error_msg} %{NOTSPACE:unknown1} %{NOTSPACE:unknown2} %{NOTSPACE:unknown3} %{NOTSPACE:unknown4} %{NOTSPACE:url} %{NOTSPACE:client_info} %{GREEDYDATA:rest}
And above Gork pattern successfully matches S3 access logs like below when I tested it using online gork tester
efaeda52d1d3e3aaa719b9cddf4a4dd161157e2f9343635589d5b625ebcba84b my-s3bucket-12345 [12/Dec/2017:13:55:33 +0000] 123.123.123.123 - 2F834DCEE973FF7B REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 400 AuthorizationHeaderMalformed 365 - 6 - "-" "AWSConfig" -
efaeda52d1d3e3aaa719b9cddf4a4dd161157e2f9343635589d5b625ebcba84b my-s3bucket-12345 [12/Dec/2017:14:32:29 +0000] 123.123.123.123 arn:aws:sts::1234567890:assumed-role/DataAccessRole 2F834DCEE973FF7B REST.GET.ACL - "GET /information-prefix/?acl HTTP/1.1" 200 - 622 - 237 - "-" "S3Console/0.4" -
Upvotes: 0
Views: 1097
Reputation: 31
The GROK pattern in this original question was helpful enough for me to get started with setting up my own crawler. However, it is definitely incomplete.
Using the documented Amazon S3 server access log format, I created this pattern which I believe is complete. Enjoy!
%{NOTSPACE:bucket_owner} %{NOTSPACE:bucket} \[%{DATA:time}\] %{NOTSPACE:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} \"%{DATA:resource_uri}\" %{NOTSPACE:http_status} %{NOTSPACE:error_code} %{NOTSPACE:bytes_sent} %{NOTSPACE:object_size} %{NOTSPACE:total_time} %{NOTSPACE:turn_around_time} \"%{NOTSPACE:referer}\" \"%{DATA:user_agent}\" %{NOTSPACE:version_id} %{NOTSPACE:host_id} %{NOTSPACE:signature_version} %{NOTSPACE:cipher_suite} %{NOTSPACE:authentication_type} %{NOTSPACE:host_header} %{NOTSPACE:tls_version} %{NOTSPACE:access_point_arn}
Note that many of these fields can be null and that Amazon puts - to represent those values. Also, note that some of the values are quoted.
Upvotes: 1
Reputation: 1525
Hope its not too late. I think the "IP" is causing problem for you, since it also create UNWANTED portion as well. Just use IPV4 instead of IP. Or you can use NOTSPACE as well.
Upvotes: 0