I am struggling when writing regular expression in python. For instance I get the following right "GET /images/launch-logo.gif HTTP/1.0" 220 1839 is matched by "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+) however I still need to include the following cases all together "GET /history/history.html hqpao/hqpao_home.html HTTP/1.0" 200 1502 "GET /shuttle/missions/missions.html Shuttle Launches from Kennedy Space Center HTTP/1.0"200 8677 "GET /finger @net.com HTTP/1.0"404 - obviously I should change the bold part of the expression "(\S+) (\S+) \s* (\S*)" (\d{3}) (\S+) But how should I change it. I have one approach in mind which is change the bold part to [\s |(\s*)(\S+) |(\S+)(12) |(\S+)] where the 2nd, 3rd , 4th expression is the (1), (2), (3) extra cases I need to deal with. But my expression do not work. What do I misunderstand about regular expression as I simply deal with it case by case.

Reputation: 39

regular expression of python

I am struggling when writing regular expression in python. For instance I get the following right

"GET /images/launch-logo.gif HTTP/1.0" 220 1839

is matched by

"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)

however I still need to include the following cases all together

"GET /history/history.html hqpao/hqpao_home.html HTTP/1.0" 200 1502
"GET /shuttle/missions/missions.html Shuttle Launches from Kennedy Space Center HTTP/1.0"200 8677
"GET /finger @net.com HTTP/1.0"404 -

obviously I should change the bold part of the expression

"(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)

But how should I change it. I have one approach in mind which is change the bold part to

[\s |(\s*)(\S+) |(\S+)(12) |(\S+)]

where the 2nd, 3rd , 4th expression is the (1), (2), (3) extra cases I need to deal with.

But my expression do not work. What do I misunderstand about regular expression as I simply deal with it case by case.

Upvotes: 2

Reputation: 627488

You may use

^"([^\s"]+)\s+([^\s"]+)(?:\s+([^"]+?))?\s+([A-Z]+/\d[\d.]*)"\s*(\d{3})\s*(\S+)$

Details

^ - start of a line (use re.M if you are reading the whole file into a variable, f.read())
" - a double quotation mark
([^\s"]+) - Group 1: one or more chars other than whitespace and a double quotation mark
\s+ - 1+ whitespaces
([^\s"]+) - Group 2: one or more chars other than whitespace and a double quotation mark
(?:\s+([^"]+?))? - an optional non-capturing group matching
- \s+ - 1+ whitespaces
- ([^"]+?) - Group 3: any 1 or more chars other than ", as few as possible
\s+ - 1+ whitespaces
([A-Z]+/\d[\d.]*) - Group 4: 1+ uppercase letters, / and then 1 digit followed with any 0+ digits or . chars
" - a double quotation mark
\s+ - 0+ whitespaces
(\d{3}) - Group 5: three digits
\s* - 0+ whitespaces
(\S+) - 1 or more non-whitespace chars
$ - end of string.

Upvotes: 0

Reputation: 1252

This Might be a bit messy but it works:

\"(\S+) (\S+[\s\w\.\@]*)\s*(\S*)\"\s?(\d{3})\s(\S+)*

You can play with it on Regexr. Regexr Shared Link

Upvotes: 1