khinester
khinester

Reputation: 3530

regex does not respect double quotes

I am trying split the following ELB entry:

2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0

using this regex:

const splitElbEntry = (elbLogEntry) => R.match(/\S+|"[^"]*"/g)(elbLogEntry.trim())

but does not seem to be working https://regex101.com/r/JOlrxS/1

I like to preserve anything in the double quotes such as

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Upvotes: 2

Views: 53

Answers (1)

ctwheels
ctwheels

Reputation: 22837

Change the order of your options: Order matters.


Why is this happening?

The regex engine will attempt each option in the order you've presented. \S+|"[^"]*" will always attempt to match \S+ first. If \S+ fails to match at a given location in the string, the second option "[^"]*" is then attempted.

Since \S matches ", the first option is the only option that will ever match with your existing regex (your second option will never be attempted), and as such you may as well just change your existing regex to \S+. Expand the snippets below to see that \S+|"[^"]*" and \S+ yield the same results.

Your regex \S+|"[^"]*":

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/\S+|"[^"]*"/g))

Your regex simplified \S+:

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/\S+/g))


How do you fix this?

Changing the order of the options tells the regex engine to try "[^"]*" first, then, if that doesn't match, to try \S+.

See regex in use here

"[^"]*"|\S+

var s = `2018-04-16T08:09:27.203Z cae70dd2-414c-11e8-836a-354cb4985a41 https 2018-04-15T01:20:31.092381Z app/MBM-L-Publi-V9D386A91UNR/4695f2e72859f540 128.121.50.133:59367 10.0.1.14:80 0.001 0.003 0.000 200 200 934 282 "GET https://www.domain.tld:443/__utm.gif?v=1&_v=j66&a=1866784098&t=pageview&_s=1&dl=https%3A%2F%2Fwww.domain.tld%2Fnews%2Farchived%2Fresources-archived%22001-11%2F&ul=en-us&de=UTF-8&dt=Racal%20reborn%20after%20Thales%20buyout&sd=24-bit&sr=412x732&vp=404x732&je=0&cid=1296878891.1495497600&_gid=1908154735.1495497600&_r=1&z=821631926 HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-2:123456789012:targetgroup/MBM-L-Cache-1LH0DNU489D55/167e4810f75804c3 "Root=1-5ad2a8df-021aaad5031047e7dec3f2fa" "www.domain.tld" "arn:aws:acm:eu-west-2:123456789012:certificate/1140cbb2-4d4f-44b0-a4d9-a79329c5e361" 0`
console.log(s.match(/"[^"]*"|\S+/g))

Upvotes: 5

Related Questions