Reputation: 3
In an Intro to Data Science course, the assignment is to find four regular expressions to match the IP address, the time, the user name, and the HTTP method from a file called "logdata" and build a list of dictionaries.
A typical line from the file looks like:
146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
Using re.findall()
, the following regular expressions work:
re.findall('(?P<time>\[(.+?)\])', logdata)
re.findall('(?P<method>".*?")', logdata)
re.findall('(?P<host>\d+\.\d+\.\d+\.\d+)', logdata)
re.findall('(?P<user_name>-\s[\w-]*)', logdata)
but when putting them together for an iteration (using re.finditer()
), such as:
for i in re.finditer('(?P<time>\[(.+?)\])(?P<host>(\d+\.\d+\.\d+\.\d+)(?P<tag>".*?")(?P<host>\d+\.\d+\.\d+\.\d+)', logdata):
f = i.groupdict() #f is a list
it does not work.
Upvotes: 0
Views: 78
Reputation: 4637
host
groups,More importantly, if you want only one loop, the resulting regular expression should be a new one, reflecting the real structure of the access log: not only the parts that interest you but also what is in between.
Like this:
^(?P<host>\d+\.\d+\.\d+\.\d+)\s+(?P<user_name>-\s[\w-]*)\s+(?P<time>\[(.+?)\])\s+(?P<method>".*?")\s+\d+\s+\d+$
, with re.M
.
Also, I think, your regular expressions for the IP address, time, etc. are not specific enough for real usage. E.g. (?P<IP>\d{1,3}(\.\d{1,3}){3})
for IP address is better, though still not pefect (will allow 300.300.300.300, for example).
https://regex101.com/r/amAN54/1
Upvotes: 0
Reputation: 15936
When breaking up a log messaGe like this using regex, I would typically recommend building up your regex one piece at a time, remembering to account for every character in a line., including separating spaces:
So, here is our log line:
146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
Now, let's start with the ip address:
for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+)', logdata):
print(i.groupdict())
After the ip address, we see a space, followed by the username, so now we add that to our re:
for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+)', logdata):
print(i.groupdict())
note a couple of things above, the -
shouldn't be part of the captured user_name
, and it should have one or more characters.
Third, we have another space between the username and the time, and you'll see we did the same thing here that we did with the -
in front o fthe usernawme, leaving the []
out of the captured group for the time:
for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+) \[(?P<time>(.+?))\]', logdata):
print(i.groupdict())
and now for the method, it's the same thing, a space, with the method as the first part after the quote:
for i in re.finditer('^(?P<host>\d+\.\d+\.\d+\.\d+) \- (?P<user_name>[\w]+) \[(?P<time>(.+?))\] "(?P<method>POST|PATCH|HEAD|GET|DELETE|PUT|CONNECT|OPTIONS|TRACE)', logdata):
print(i.groupdict())
I used a tighter expression since technically, valid methods are only the first word, but you can use your existing regex if you want to capture the route as well as the http version.
Upvotes: 2