Haydee Garcia
Haydee Garcia

Reputation: 3

Regular expressions do not work when combined

In an Intro to Data Science course, the assignment is to find four regular expressions to match the IP address, the time, the user name, and the HTTP method from a file called "logdata" and build a list of dictionaries.

A typical line from the file looks like:

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

Using re.findall(), the following regular expressions work:

re.findall('(?P<time>\[(.+?)\])', logdata)

re.findall('(?P<method>".*?")', logdata)

re.findall('(?P<host>\d+\.\d+\.\d+\.\d+)', logdata)

re.findall('(?P<user_name>-\s[\w-]*)', logdata)

but when putting them together for an iteration (using re.finditer()), such as:

for i in re.finditer('(?P<time>\[(.+?)\])(?P<host>(\d+\.\d+\.\d+\.\d+)(?P<tag>".*?")(?P<host>\d+\.\d+\.\d+\.\d+)', logdata):
    f = i.groupdict()  #f is a list

it does not work.

Upvotes: 0

Views: 78

Answers (2)

Alexander Mashin
Alexander Mashin

Reputation: 4637

  1. You have two host groups,
  2. One closing parenthesis is missing.

More importantly, if you want only one loop, the resulting regular expression should be a new one, reflecting the real structure of the access log: not only the parts that interest you but also what is in between.

Like this: ^(?P<host>\d+\.\d+\.\d+\.\d+)\s+(?P<user_name>-\s[\w-]*)\s+(?P<time>\[(.+?)\])\s+(?P<method>".*?")\s+\d+\s+\d+$, with re.M.

Also, I think, your regular expressions for the IP address, time, etc. are not specific enough for real usage. E.g. (?P<IP>\d{1,3}(\.\d{1,3}){3}) for IP address is better, though still not pefect (will allow 300.300.300.300, for example).

https://regex101.com/r/amAN54/1

Upvotes: 0

2ps
2ps

Reputation: 15936

When breaking up a log messaGe like this using regex, I would typically recommend building up your regex one piece at a time, remembering to account for every character in a line., including separating spaces:

So, here is our log line:

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

Now, let's start with the ip address:

for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+)', logdata):
    print(i.groupdict())

After the ip address, we see a space, followed by the username, so now we add that to our re:

for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+)', logdata):
    print(i.groupdict())

note a couple of things above, the - shouldn't be part of the captured user_name, and it should have one or more characters.

Third, we have another space between the username and the time, and you'll see we did the same thing here that we did with the - in front o fthe usernawme, leaving the [] out of the captured group for the time:

for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+) \[(?P<time>(.+?))\]', logdata):
    print(i.groupdict())

and now for the method, it's the same thing, a space, with the method as the first part after the quote:

for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+) \[(?P<time>(.+?))\] "(?P<method>POST|PATCH|HEAD|GET|DELETE|PUT|CONNECT|OPTIONS|TRACE)', logdata):
    print(i.groupdict())

I used a tighter expression since technically, valid methods are only the first word, but you can use your existing regex if you want to capture the route as well as the http version.

Upvotes: 2

Related Questions