CoderGuru
CoderGuru

Reputation: 63

Why is my regex pattern collecting more than I am expecting?

I am trying to create a regex pattern to get the email address only after the word "Sender".

Below is example input:

Recip: [email protected]
Subject: Report results (Gd)
Headers: Received: from daem.com (unknown [127.1.1.1])
Date: Sat, 13 Feb 2021 13:11:42 +0000 (GMT)
From: Tavon Lo <[email protected]>
Recip: [email protected]
Subject: Report results (Gd1)
Headers: Received: from daem2.com (unknown [127.1.1.1])
Date: Sat, 14 Feb 2021 13:11:42 +0000 (GMT)
From: Tavon Lo <[email protected]>
Sender: [email protected]
Recipient: [email protected]

So, the only email address that should be in the output is [email protected]

Below is my regex pattern:

(?m)^Sender:([^<>@]+@[^<>]+)

This matches the following:

[email protected]
Recipient: [email protected]

See regex demo https://regex101.com/r/qRLrAW/1

I only want [email protected]. I am new to regex patterns so this is probably an easy fix but I have been stuck. Any ideas or suggestions as how to fix the regex pattern to accommodate my goal?

Upvotes: 1

Views: 61

Answers (4)

The fourth bird
The fourth bird

Reputation: 163287

The catch here is that you have to exclude matching newlines by adding them to the negated character class.

You can also turn the match into a positive lookbehind:

(?m)(?<=^Sender: )[^<>@\n\r]+@[^<>\r\n]+

Regex demo

If the email address can also not contain spaces, you can use \s instead of \r\n

(?m)(?<=^Sender: )[^<>@\s]+@[^<>\s]+

The pattern matches:

  • (?m) Inline modifier for multiline
  • (?<=^Sender: ) Assert Sender: at the left at the start of the string
  • [^<>@\s]+@[^<>\s]+ Match an email like pattern excluding spaces and newlines

Regex demo

Just as an example using the PyPi regex module you might also use \K to get the match only.

Upvotes: 1

Sardar Badar Saghir
Sardar Badar Saghir

Reputation: 65

I think this expression is will useful for where first part will remove Sender expression where . and + will select email area

(?<=Sender: ).+

Upvotes: 0

Omar Si
Omar Si

Reputation: 154

It's because [^<>]+ matches \n as well, so it will go over the end of the line to the next line.

You need to add a \n to your negated character classes, as Wiktor Stribiżew did in his answer.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

You can use

(?m)^Sender:[^\S\r\n]*([^<>@\n\r]+@[^<>\n\r]+)

See the regex demo.

Details:

  • (?m)^ - start of a line
  • Sender: - a literal string
  • [^\S\r\n]* - zero or more whitespaces other than CR and LF
  • ([^<>@\n\r]+@[^<>\n\r]+) - Group 1: one or more chars other than <, >, @, CR and LF, @ and one or more chars other than <, >, @, CR and LF.

Upvotes: 2

Related Questions