A Saraf
A Saraf

Reputation: 275

scala - extract substring based on special characters

I have to fetch First name, Middle and Last name from String based on special characters.

First name condition - if name_str contains comma(",") and ends with space+any single character+period(".")

For example: name_str - SMITH, ANNE MARIE J. Then First name - ANNE MARIE

Middle name condition - if name_str contains comma(",") and ends with space+any single character+period(".") Then take the substring single character before "." until space

For example: name_str - SMITH, ANNE MARIE J. Then Middle name - J.

Last name - SMITH

I tried below code to get First name, need to add more condition to check if name_str ends with Space+any Character+period(".")

if (",.".forall(name_str.contains(,)))
  name_str.substring(name_str.indexOf(",") + 1, name_str.indexOf(" ")).trim

Upvotes: 1

Views: 625

Answers (2)

jwvh
jwvh

Reputation: 51271

You could create a simple regex for each of the name formats that you're expected to parse.

val nameRE1 = "([^,]+),(.+) (.\\.)".r
val nameRE2 = "([^,]+),(.+)".r
val nameRE3 = "(.+) (.\\.) (.+)".r
val nameRE4 = "([^,]+) (.+)".r

List( "SMITH, ANNE MARIE J."
    , "Michael J. Fox"
    , "Van Halen, Eddie"
    , "Jo Blow"
    ).map{
  case nameRE1(ln, fn, mi) => List(fn.strip, mi, ln.strip)
  case nameRE2(ln, fn)     => List(fn.strip, "", ln.strip)
  case nameRE3(fn, mi, ln) => List(fn.strip, mi, ln.strip)
  case nameRE4(fn, ln)     => List(fn.strip, "", ln.strip)
  case nameX               => List(nameX)
}
//res0: List[List[String]] = List(List(ANNE MARIE, J., SMITH)
//                              , List(Michael, J., Fox)
//                              , List(Eddie, "", Van Halen)
//                              , List(Jo, "", Blow))

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163207

Matching names can be really difficult. For the description in your question, you might use a broad pattern that approaches the given format as names can contain a lot of characters.

It matches the lastname part before the comma, the firstname part after the comma and before the single char dot pattern at the end.

^([^\s,][^,]*),\h*([^\s,].*?)\h+([^\s.]\.(?:[^\s.]\.)*)$
  • ^ Start of string
  • ( Capture group 1
    • [^\s,][^,]* Match a single non whitespace char except for a comma, followed by matching any char except a comma
  • ) Close group 1
  • ,\h* Match a comma and optional spaces
  • ( Capture group 2
    • [^\s,].*? Match a single non whitespace char except for a comma ) Close group 2
  • \h+ Match 1+ spaces
  • ( Capture group 3
    • [^\s.]\. Match a single non whitespace char except for a dot, then match a dot
    • (?:[^\s.]\.)* Optionally repeat the same in case of multiple single characters followed by a dot
  • ) Close group 3
  • $ End of string

See a regex demo or a Scala demo

val s = "SMITH, ANNE MARIE J."
val regex =
  """^([^\s,][^,]*),\h*([^\s,].*?)\h+([^\s.]\.(?:[^\s.]\.)*)$"""
    .r("lastname", "firstname", "middlename")

regex.findFirstMatchIn(s) match {
  case Some(m) => println(
    s"Lastname: ${m.group("lastname")}, " +
      s"Firstname: ${m.group("firstname")}, " +
      s"Middlename: ${m.group("middlename")}"
  )
  case None => println("No match.")
}

Output

Lastname: SMITH, Firstname: ANNE MARIE, Middlename: J.

Upvotes: 1

Related Questions