Keith
Keith

Reputation: 37

How to compile several regex into one

Good morning, I need to compile a several regular expressions into one pattern Regular expressions are like this:

reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>htt[p|ps]:.*?)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'

Regular expressions are written for strings from apache access.log:

109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374

Tried to compile it with code like this:

some_pattern =  re.compile(reg_ip.join(reg_meth).join(reg_status))

Obviously it doesn't work that way. How to do it right?

Upvotes: 0

Views: 103

Answers (1)

Julio
Julio

Reputation: 5308

You need some glue between regexes.

You have two options:

  • join regexes via alternation: regex1|regex2|regex3|... and use global search
  • add missing glue betweek regexes: for example, between reg_status and reg_url you may need to add r'[^"]+' to skip the next number

The problem with alternation is that you could find the regexes at any place. So you could find for example the word post (or a number) inside an url.

So for me, the second option is better.

This is the glue I would use:

import re

reg_ip = r'(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
reg_meth = r'(?P<METHOD>GET|POST|PUT|DELETE|HEAD)'
reg_status = r'\s(?P<STATUS>20[0-9]|30[0-9]|40[0-9]|50[0-9])\s'
#reg_400 = r'\s(?P<STATUS_400>40[0-9])\s'
#reg_500 = r'\s(?P<STATUS_500>50[0-9])\s'
reg_url = r'"(?P<URL>https?:[^"]+)"'
reg_rt = r'\s(?P<REQ_TIME>\d{4})$'

some_pattern =  re.compile(reg_meth + r'\s+[^]]+\s*"' + reg_status + r'[^"]+' + reg_url + r'\s*"[^"]+"\s*' + reg_rt)
print(some_pattern)

line = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "POST /administrator/index.php HTTP/1.1" 200 4494 "http://almhuette-raith.at/administrator/" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 4374'

print(some_pattern.search(line))

For the glue, these are the pieces I used:

\s*   : Capture any 'whitespace' 0 or more times
\s+   : Capture any 'whitespace' 1 or more times
[^X]+ : Where 'X' is some character; Capture any non-X characters one or more times

By the way:

This htt[p|ps] is not correct. You can simply use https? instead. Or if you want to do it with groups: htt(p|ps) or http(?:p|ps) (Last one is a non-capturing group, which is preferred if you dont want to capture its content)

Upvotes: 1

Related Questions