NemraOx
NemraOx

Reputation: 123

Scala: How to parse a string and generate a vector of nested vectors

Objective:

Given a string of the form val strang = "ab(cde(fgh)ijk)lmn)opq)rs"

First I want to construct a vector of the string split by the "(" and ")" characters. This is accomplished with,

val regexPattern = "((?<=\\()|(?=\\())|((?<=\\))|(?=\\)))"
val data: Vector[String] = strang .split(regexPattern ).map(_.toString).toVector

Note that the actual string will be much more varied, but how it's split stands.

Secondly and this is the hard part that I need help with. I want to iterate over the data vector and construct a nested vector with the following logic (below is psudo-codish)

val newVector =
for(row <- data) {
    if(row == "(") // opening angle bracket
        start constructing a nested vector inside of newVector
        keep appending to the nested vector until we encounter a ")"
        character (closing parenthesis).
        IF before encountering a ")" we encounter another "(" then
        create another nestedVector inside the previous nested vector
        and keep appending to it until a closing parenthesis ")" is
        encountered

    else if(row == ")")
        if we encounter a closing parenthesis, go up one level of the nested 
        vectors
    else
        simply append this row to the new vector
        newVector :+ row
}

I have made quite a few attempts at this with limited success. Using a recursive function I manage just one level of nesting but to go beyond that I end up using while loops which seems like overkill. The nested vectors can be of a special type. I have tried for example case class Rowy(row: String, rows: Vector[Vector[String]]) or case class Rowy(row: String, rows: Vector(Rowy))

I'm relatively new to scala and am wondering if there is an elegant way to approach this using scanLeft or foldLeft methods. Any help would be appreciated.

PS: I don't use an actual for loop to iterate over the "data" vector but rather a recursive function. The for loop is just to convey that an iteration is taking place. Any help would be appreciated.

Upvotes: 1

Views: 886

Answers (1)

Vladimir Matveev
Vladimir Matveev

Reputation: 127721

In your example, parentheses in the string are not balanced, and it is not really clear from your algorithm definition how it should be handled. If we assume that the input is actually well-balanced, then your native data structure would be a tree, with branches being one level of parentheses and leaves being the "words". Then the algorithm you need is just a simple stack-based tree construction. In the following example I use an explicit stack, but you can rewrite the algorithm to be recursive, and then the stack will be implicit.

// Tree structure definition
sealed trait Data
object Data {
  case class Leaf(data: String) extends Data
  case class Branch(parts: Vector[Data]) extends Data {
    // a convenience method to append data to a branch more easily
    def :+(data: Data): Branch = copy(parts = parts :+ data)
  }
}

object Main extends App {
  val regexPattern = "((?<=\\()|(?=\\())|((?<=\\))|(?=\\)))"
  val string = "ab(cde(fgh)ijk)lmn"  // now the input is balanced
  val data: Vector[String] = string.split(regexPattern).toVector

  println(data)

  var stack: List[Data.Branch] = Data.Branch(Vector.empty) :: Nil
  for (part <- data) {
    part match {
      case "(" =>
        stack = Data.Branch(Vector.empty) :: stack
      case ")" =>
        val top :: parent :: rest = stack
        stack = (parent :+ top) :: rest
      case _ =>
        stack = (stack.head :+ Data.Leaf(part)) :: stack.tail
    }
  }
  val result = stack.head

  println(result)
}

It outputs this:

Vector(ab, (, cde, (, fgh, ), ijk, ), lmn)
Branch(Vector(Leaf(ab), Branch(Vector(Leaf(cde), Branch(Vector(Leaf(fgh))), Leaf(ijk))), Leaf(lmn)))

If we reformat that a bit, we get

Branch(Vector(
  Leaf(ab),
  Branch(Vector(
    Leaf(cde),
    Branch(Vector(
      Leaf(fgh)
    )),
    Leaf(ijk)
  )),
  Leaf(lmn)
))

which, as you can see, mirrors the structure inside your string.

If your input could have unbalanced parentheses, then you might need to add additional checks to the loop in the case ")" part. Right now, if the input is not balanced, a MatchError would be thrown because the stack variable would not have the required number of elements.

Upvotes: 1

Related Questions