Kitanotori
Kitanotori

Reputation: 1821

Unable to parse a complex language with regex and Scala parser combinators

I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:

private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
  case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}

private def _headword[String] = "\".*\"".r | "[^(),]*".r

private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
  case Some(mods) => mods
  case None       => List[UWModifier]()
}

private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
  case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}

private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
  case Nil   => List[UWAttribute]()
  case attrs => attrs
}

private def _attribute: Parser[UWAttribute] = ".@" ~> "[^>.]*".r ^^ (new UWAttribute(_))

The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.

_uw should be able to parse these test strings correctly:

test0
test1.@attr
"test2"
"test3".@attr
test4..
test5..@attr
"test6..".@attr
"test7.@attr".@attr
test8(urel>uw)
test9(urel>uw).@attr
"test10..().@"(urel>uw).@attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).@attr1.@attr2

So if the headword starts and ends with ", everything inside the double quotes is considered to be part of the headword. All words starting with .@, if they are not inside the double quotes, are attributes of the headword.

E.g. in test5, the parser should parse test5. as headword, and attr as an attribute. Just .@ is omitted, and all dots before that should be contained in the headword.

So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .@ is considered as part of the headword.

The main problem is "[^@(]*".r. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\@)|$)".r, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!

Upvotes: 0

Views: 289

Answers (1)

Daniel C. Sobral
Daniel C. Sobral

Reputation: 297205

I don't think "[^@(]*".r has anything to do with your problem. I see this:

private def _headword[String] = "\".*\"".r | "[^(),]*".r

which is the first thing in _uw (and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..@attr, the second regexp will match all of it!

scala> "[^(),]*".r findFirstIn "test5..@attr"
res0: Option[String] = Some(test5..@attr)

So there will be nothing left for the remaining parsers. Also, the first regex in _headword is also problematic, because .* will accept quotes, which means that something like this becomes valid:

"test6 with a " inside of it..".@attr

As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.

Upvotes: 1

Related Questions