Reputation: 11
I'm trying to parse a web log with regular expressions using RegexSerDe. It works by matching each regex group with a column in a table and if the regex group is empty it assigns a null to that column.
I'm having trouble matching log rows with missing fields. There are two kinds of rows in this log:
<134>2016-10-23T23:59:59Z cache-iad2134 fastly[502801]: 52.55.94.131 "-" "-" Sun, 23 Oct 2016 23:59:59 GMT GET /apps/events/2016/10/11/3062653/?REC_ID=3062653&id=0 200
<134>2016-10-23T23:59:59Z cache-dfw1835 fastly[502801]: 1477267199
I wrote the below regex that matches the first type of row with all fields:
^(\\S+) (\\S+) (\\S+) (\\S+) "(\\S+)" "(\\S+)" (.*) (\\d{3})
But I played around with ?
to get the regex to optionally ignore the fields after the first 4 but kept messing up the columns.
Any suggestions on how I should add the ?
without changing the number of groups (so that the deserializer doesn't cough up)? Or any other way to do this you would suggest?
Upvotes: 1
Views: 355
Reputation: 780994
Put a non-capturing group around all the fields after the first 4, and make it optional.
^(\\S+) (\\S+) (\\S+) (\\S+)(?: "(\\S+)" "(\\S+)" (.*) (\\d{3}))?
Putting ?:
at the beginning of a group makes it non-capturing. So this group doesn't affect the number of groups that are captured.
Upvotes: 1