Reputation: 16199
The following regex:
(?:X-)?Received: (?:by|from) ([^ \n]+)
will, for the following lines, match the text in bold:
Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03
Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;
Received: from localhost (localhost [127.0.0.1])
If I alter the text such that "Received by: " and "Received: from " are removed in each line, leaving me with:
from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03
by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;
from localhost (localhost [127.0.0.1])
How do I update the regex then to just match the IP addresses or domains (e.g. mail.oknotify2.com, 10.66.156.198) in this text?
I can reduce it to (?:by|from) ([^ \n]+)
and that will give me "from mail.oknotify2.com", "by 10.66.156.198" etc. But how do I go the last step and omit the "by " and "from ", leaving only the domain/IP address? The final regex should also, as the original, ignore subsequent domains/IPs per line where present e.g. mx.google.com in the first line.
Upvotes: 1
Views: 156
Reputation: 4259
You can use \K to discard previous matches:
(?:X-)?Received: (?:by|from) \K([\S]+)
See Demo
EDIT:
Like @James Newton said, this however is not supported by all regex flavors, you can refer to this post to see if your engine supports it:
https://stackoverflow.com/a/13543042/3393095
EDIT 2:
Since you specified Python, just using the capturing groups and re.findall
on your regex will do, like this:
>>> import re
>>> text = ("Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n"
... "Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n"
... "Received: from localhost (localhost [127.0.0.1])")
>>> re.findall(r'(?:X-)?Received: (?:by|from) ([\S]+)', text)
['mail2.oknotify2.com', '10.66.156.198', 'localhost']
Upvotes: 2
Reputation: 626927
You can use the re.MULTILINE
flag to enable multiline mode to enable matching some text at the start of a line with ^
. To get the necessary text, you will have to use a capturing group.
It is a pity that Python regex does not support \K
, nor variable-width look-behind (with the native re
library). However, a variable width look-behind is possible to use with the regex
external library.
Here is a sample code that you can use:
import re
p = re.compile(ur'^(?:by|from) (\S+)', re.MULTILINE)
test_str = u"from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n\nby 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n\nfrom localhost (localhost [127.0.0.1])"
print [x.group(1) for x in re.finditer(p, test_str)]
Output of a demo program:
[u'mail2.oknotify2.com', u'10.66.156.198', u'localhost']
Upvotes: 1
Reputation: 7092
I'm writing an answer because a comment does not allow for formatting, but the correct answer is given by @stribizhev.
@stribizhev proposed this regex:
^(?:by|from) (\S+)
The ?:
at the beginning of (?:by|from)
makes it a non-capturing group. (\S+)
is a capturing group. If you use result = string.match(regex)
, and there is a match, then result
will contain an array such as ["from mail2.oknotify2.com", "mail2.oknotify2.com"]
. The value of result[1] is the captured group.
Upvotes: 1