user4035
user4035

Reputation: 23729

Regex doesn't work when newline is at the end of the string

Exercise: given a string with a name, then space or newline, then email, then maybe newline and some text separated by newlines capture the name and the domain of email.

So I created the following:

val regexp = "^([a-zA-Z]+)(?:\\s|\\n)\\w+@(\\w+\\.\\w+)(?:.|\\r|\\n)*".r

def fun(str: String): String = {
  val result = str match {
    case regexp(name, domain) => name + ' ' + domain
    case _ => "invalid"
  }
  result
}

And started testing:

scala> val input = "oleg [email protected]"
scala> fun(input)
res17: String = oleg email.com
scala> val input = "oleg\[email protected]"
scala> fun(input)
res18: String = oleg email.com
scala> val input = """oleg
     | [email protected]
     | 7bdaf0a1be3"""

scala> fun(input)
res19: String = oleg email.com
scala> val input = """oleg
     | [email protected]
     | 7bdaf0a1be3
     | """

scala> fun(input)
res20: String = invalid

Why doesn't the regexp capture the string with the newline at the end?

Upvotes: 0

Views: 211

Answers (1)

The fourth bird
The fourth bird

Reputation: 163362

This part (?:\\s|\\n) can be shortened to \s as it will also match a newline, and as there is still a space before the emails where you are using multiple lines it can be \s+ to repeat it 1 or more times.

Matching any character like this (?:.|\\r|\\n)* if very inefficient due to the alternation. You can use either [\S\s]* or use an inline modifier (?s) to make the dot match a newline.

But using your pattern to just get the name and the domain of the email you don't have to match what comes after it, as you are using the 2 capturing groups in the output.

^([a-zA-Z]+)\s+\w+@(\w+\.\w+)

Regex demo

If you do want to match all that follows, you can use:

val regexp = """(?s)^([a-zA-Z]+)\s+\w+@(\w+\.\w+).*""".r

def fun(str: String): String = {
  val result = str match {
    case regexp(name, domain) => name + ' ' + domain
    case _ => "invalid"
  }
  result
}

Scala demo

Note that this pattern \w+@(\w+\.\w+) is very limited for matching an email

Upvotes: 2

Related Questions