Mete Kural
Mete Kural

Reputation: 11

Optional fields when matching log file rows using regex

I'm trying to parse a web log with regular expressions using RegexSerDe. It works by matching each regex group with a column in a table and if the regex group is empty it assigns a null to that column.

I'm having trouble matching log rows with missing fields. There are two kinds of rows in this log:

<134>2016-10-23T23:59:59Z cache-iad2134 fastly[502801]: 52.55.94.131 "-" "-" Sun, 23 Oct 2016 23:59:59 GMT GET /apps/events/2016/10/11/3062653/?REC_ID=3062653&id=0 200

<134>2016-10-23T23:59:59Z cache-dfw1835 fastly[502801]: 1477267199

I wrote the below regex that matches the first type of row with all fields:

^(\\S+) (\\S+) (\\S+) (\\S+) "(\\S+)" "(\\S+)" (.*) (\\d{3})

But I played around with ? to get the regex to optionally ignore the fields after the first 4 but kept messing up the columns.

Any suggestions on how I should add the ? without changing the number of groups (so that the deserializer doesn't cough up)? Or any other way to do this you would suggest?

Upvotes: 1

Views: 355

Answers (1)

Barmar
Barmar

Reputation: 780994

Put a non-capturing group around all the fields after the first 4, and make it optional.

^(\\S+) (\\S+) (\\S+) (\\S+)(?: "(\\S+)" "(\\S+)" (.*) (\\d{3}))?

Putting ?: at the beginning of a group makes it non-capturing. So this group doesn't affect the number of groups that are captured.

Upvotes: 1

Related Questions