Regular expressions do not work when combined

Question

In an Intro to Data Science course, the assignment is to find four regular expressions to match the IP address, the time, the user name, and the HTTP method from a file called "logdata" and build a list of dictionaries.

A typical line from the file looks like:

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

Using re.findall(), the following regular expressions work:

re.findall('(?P$$(.+?)$$)', logdata)

re.findall('(?P".*?")', logdata)

re.findall('(?P\d+\.\d+\.\d+\.\d+)', logdata)

re.findall('(?P-\s[\w-]*)', logdata)

but when putting them together for an iteration (using re.finditer()), such as:

for i in re.finditer('(?P$$(.+?)$$)(?P(\d+\.\d+\.\d+\.\d+)(?P".*?")(?P\d+\.\d+\.\d+\.\d+)', logdata):
    f = i.groupdict()  #f is a list

it does not work.

2ps · Accepted Answer

When breaking up a log messaGe like this using regex, I would typically recommend building up your regex one piece at a time, remembering to account for every character in a line., including separating spaces:

So, here is our log line:

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

Now, let's start with the ip address:

for i in re.finditer('^(?P\d+\.\d+\.\d+\.\d+)', logdata):
    print(i.groupdict())

After the ip address, we see a space, followed by the username, so now we add that to our re:

for i in re.finditer('^(?P\d+\.\d+\.\d+\.\d+) \- (?P[\w]+)', logdata):
    print(i.groupdict())

note a couple of things above, the - shouldn't be part of the captured user_name, and it should have one or more characters.

Third, we have another space between the username and the time, and you'll see we did the same thing here that we did with the - in front o fthe usernawme, leaving the [] out of the captured group for the time:

for i in re.finditer('^(?P\d+\.\d+\.\d+\.\d+) \- (?P[\w]+) $$(?P(.+?))$$', logdata):
    print(i.groupdict())

and now for the method, it's the same thing, a space, with the method as the first part after the quote:

for i in re.finditer('^(?P\d+\.\d+\.\d+\.\d+) \- (?P[\w]+) $$(?P(.+?))$$ "(?PPOST|PATCH|HEAD|GET|DELETE|PUT|CONNECT|OPTIONS|TRACE)', logdata):
    print(i.groupdict())

I used a tighter expression since technically, valid methods are only the first word, but you can use your existing regex if you want to capture the route as well as the http version.

Regular expressions do not work when combined

Answers (2)

Related Questions