Reputation: 1821
I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:
private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}
private def _headword[String] = "\".*\"".r | "[^(),]*".r
private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
case Some(mods) => mods
case None => List[UWModifier]()
}
private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}
private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
case Nil => List[UWAttribute]()
case attrs => attrs
}
private def _attribute: Parser[UWAttribute] = ".@" ~> "[^>.]*".r ^^ (new UWAttribute(_))
The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.
_uw should be able to parse these test strings correctly:
test0
test1.@attr
"test2"
"test3".@attr
test4..
test5..@attr
"test6..".@attr
"test7.@attr".@attr
test8(urel>uw)
test9(urel>uw).@attr
"test10..().@"(urel>uw).@attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).@attr1.@attr2
So if the headword starts and ends with "
, everything inside the double quotes is considered to be part of the headword. All words starting with .@
, if they are not inside the double quotes, are attributes of the headword.
E.g. in test5, the parser should parse test5.
as headword, and attr
as an attribute. Just .@ is omitted, and all dots before that should be contained in the headword.
So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .@
is considered as part of the headword.
The main problem is "[^@(]*".r
. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\@)|$)".r
, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!
Upvotes: 0
Views: 289
Reputation: 297205
I don't think "[^@(]*".r
has anything to do with your problem. I see this:
private def _headword[String] = "\".*\"".r | "[^(),]*".r
which is the first thing in _uw
(and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..@attr
, the second regexp will match all of it!
scala> "[^(),]*".r findFirstIn "test5..@attr"
res0: Option[String] = Some(test5..@attr)
So there will be nothing left for the remaining parsers. Also, the first regex in _headword
is also problematic, because .*
will accept quotes, which means that something like this becomes valid:
"test6 with a " inside of it..".@attr
As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.
Upvotes: 1