Reputation: 37
Good morning, I need to compile a several regular expressions into one pattern Regular expressions are like this:
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>htt[p|ps]:.*?)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
Regular expressions are written for strings from apache access.log:
109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374
Tried to compile it with code like this:
some_pattern = re.compile(reg_ip.join(reg_meth).join(reg_status))
Obviously it doesn't work that way. How to do it right?
Upvotes: 0
Views: 103
Reputation: 5308
You need some glue between regexes.
You have two options:
regex1|regex2|regex3|...
and use global searchr'[^"]+'
to skip the next numberThe problem with alternation is that you could find the regexes at any place. So you could find for example the word post
(or a number) inside an url.
So for me, the second option is better.
This is the glue I would use:
import re
reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
#reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
#reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>https?:[^"]+)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'
some_pattern = re.compile(reg_meth + r'\s+[^]]+\s*"' + reg_status + r'[^"]+' + reg_url + r'\s*"[^"]+"\s*' + reg_rt)
print(some_pattern)
line = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374'
print(some_pattern.search(line))
For the glue, these are the pieces I used:
\s* : Capture any 'whitespace' 0 or more times
\s+ : Capture any 'whitespace' 1 or more times
[^X]+ : Where 'X' is some character; Capture any non-X characters one or more times
By the way:
This htt[p|ps]
is not correct. You can simply use https?
instead. Or if you want to do it with groups: htt(p|ps)
or http(?:p|ps)
(Last one is a non-capturing group, which is preferred if you dont want to capture its content)
Upvotes: 1