Pyderman
Pyderman

Reputation: 16199

Regex: How to leave out string + space before a domain or IP address?

The following regex:

(?:X-)?Received: (?:by|from) ([^ \n]+)

will, for the following lines, match the text in bold:

Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03

Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;

Received: from localhost (localhost [127.0.0.1])

If I alter the text such that "Received by: " and "Received: from " are removed in each line, leaving me with:

from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03

by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;

from localhost (localhost [127.0.0.1])

How do I update the regex then to just match the IP addresses or domains (e.g. mail.oknotify2.com, 10.66.156.198) in this text?

I can reduce it to (?:by|from) ([^ \n]+) and that will give me "from mail.oknotify2.com", "by 10.66.156.198" etc. But how do I go the last step and omit the "by " and "from ", leaving only the domain/IP address? The final regex should also, as the original, ignore subsequent domains/IPs per line where present e.g. mx.google.com in the first line.

Upvotes: 1

Views: 156

Answers (3)

Rodrigo López
Rodrigo López

Reputation: 4259

You can use \K to discard previous matches:

(?:X-)?Received: (?:by|from) \K([\S]+)

See Demo

EDIT:

Like @James Newton said, this however is not supported by all regex flavors, you can refer to this post to see if your engine supports it:

https://stackoverflow.com/a/13543042/3393095

EDIT 2:

Since you specified Python, just using the capturing groups and re.findall on your regex will do, like this:

>>> import re
>>> text = ("Received: from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n"
... "Received: by 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n"
... "Received: from localhost (localhost [127.0.0.1])")
>>> re.findall(r'(?:X-)?Received: (?:by|from) ([\S]+)', text)
['mail2.oknotify2.com', '10.66.156.198', 'localhost']

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You can use the re.MULTILINE flag to enable multiline mode to enable matching some text at the start of a line with ^. To get the necessary text, you will have to use a capturing group.

It is a pity that Python regex does not support \K, nor variable-width look-behind (with the native re library). However, a variable width look-behind is possible to use with the regex external library.

Here is a sample code that you can use:

import re
p = re.compile(ur'^(?:by|from) (\S+)', re.MULTILINE)
test_str = u"from mail2.oknotify2.com (mail2.oknotify2.com. [208.83.243.70]) by mx.google.com with ESMTP id dp5si2596299pdb.170.2015.06.03.14.12.03\n\nby 10.66.156.198 with SMTP id wg6mr62843415pab.126.1433365924352;\n\nfrom localhost (localhost [127.0.0.1])"
print [x.group(1) for x in re.finditer(p, test_str)]

Output of a demo program:

[u'mail2.oknotify2.com', u'10.66.156.198', u'localhost']

Upvotes: 1

James Newton
James Newton

Reputation: 7092

I'm writing an answer because a comment does not allow for formatting, but the correct answer is given by @stribizhev.

@stribizhev proposed this regex:

^(?:by|from) (\S+)

The ?: at the beginning of (?:by|from) makes it a non-capturing group. (\S+) is a capturing group. If you use result = string.match(regex), and there is a match, then result will contain an array such as ["from mail2.oknotify2.com", "mail2.oknotify2.com"]. The value of result[1] is the captured group.

Upvotes: 1

Related Questions