Reputation: 39
I am struggling when writing regular expression in python. For instance I get the following right
"GET /images/launch-logo.gif HTTP/1.0" 220 1839
is matched by
"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)
however I still need to include the following cases all together
"GET /history/history.html hqpao/hqpao_home.html
HTTP/1.0" 200 1502
"GET /shuttle/missions/missions.html Shuttle Launches from
Kennedy Space Center HTTP/1.0"200 8677
"GET /finger @net.com HTTP/1.0"404 -
obviously I should change the bold part of the expression
"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)
But how should I change it. I have one approach in mind which is change the bold part to
[\s |(\s*)(\S+) |(\S+)(12) |(\S+)]
where the 2nd, 3rd , 4th expression is the (1), (2), (3) extra cases I need to deal with.
But my expression do not work. What do I misunderstand about regular expression as I simply deal with it case by case.
Upvotes: 2
Views: 101
Reputation: 627488
You may use
^"([^\s"]+)\s+([^\s"]+)(?:\s+([^"]+?))?\s+([A-Z]+/\d[\d.]*)"\s*(\d{3})\s*(\S+)$
See the regex demo
Details
^
- start of a line (use re.M
if you are reading the whole file into a variable, f.read()
) "
- a double quotation mark([^\s"]+)
- Group 1: one or more chars other than whitespace and a double quotation mark\s+
- 1+ whitespaces([^\s"]+)
- Group 2: one or more chars other than whitespace and a double quotation mark(?:\s+([^"]+?))?
- an optional non-capturing group matching
\s+
- 1+ whitespaces([^"]+?)
- Group 3: any 1 or more chars other than "
, as few as possible\s+
- 1+ whitespaces([A-Z]+/\d[\d.]*)
- Group 4: 1+ uppercase letters, /
and then 1 digit followed with any 0+ digits or .
chars"
- a double quotation mark\s+
- 0+ whitespaces(\d{3})
- Group 5: three digits\s*
- 0+ whitespaces(\S+)
- 1 or more non-whitespace chars$
- end of string.Upvotes: 0
Reputation: 1252
This Might be a bit messy but it works:
\"(\S+) (\S+[\s\w\.\@]*)\s*(\S*)\"\s?(\d{3})\s(\S+)*
You can play with it on Regexr. Regexr Shared Link
Upvotes: 1