Brian Hsu
Brian Hsu

Reputation: 8821

How to restrict nestead markup in Regex and Parser combinator?

I will like to implement a simple Wiki-like mark up parser as a exercise of using Scala parser combinator.

I would like to solve this bit by bit, so here is what I would like to achieve in the first version: a simple inline literal markup.

For example, if the input string is:

This is a sytax test ``code here`` . Hello ``World``

The output string should be:

This is a sytax test <code>code here</code> . Hello <code>World</code>

I try to solve this by using RegexParsers, and here is what I've done now:

import scala.util.parsing.combinator._
import scala.util.parsing.input._

object TestParser extends RegexParsers
{   
    override val skipWhitespace = false

    def toHTML(s: String) = "<code>" + s.drop(2).dropRight(2) + "</code>"

    val words = """(.)""".r
    val literal = """\B``(.)*``\B""".r ^^ toHTML

    val markup = (literal | words)*

    def run(s: String) = parseAll(markup, s) match {
        case Success(xs, next) => xs.mkString
        case _ => "fail"
    }
}

println (TestParser.run("This is a sytax test ``code here`` . Hello ``World``"))

In this code, a simpler input which only contains one <code> markup works fine, for example:

This is a sytax test ``code here``.

become

This is a sytax test <code>code here</code>.

But when I run it with above example, it will yield

This is a sytax test <code>code here`` . Hello ``World</code>

I think this is because the regex I use:

"""\B``(.)*``\B""".r

allowed any characters in `` pairs.

I would like to know know should I limit there could not have nested `` and fix this problem?

Upvotes: 0

Views: 156

Answers (3)

xaxxon
xaxxon

Reputation: 19761

Here's some docs on non-greedy matching:

http://www.exampledepot.com/egs/java.util.regex/Greedy.html

Basically it's starting at the first `` and going as far as it can to get a match, which matches the `` at the end of world.

By putting a ? after your *, you tell it to do the shortest match possible, instead of the longest match.

Another option is to use [^`]* (anything EXCEPT `), and that will force it to stop earlier.

Upvotes: 2

Luigi Plinge
Luigi Plinge

Reputation: 51109

I don't know much about regex parsers, but you can use a simple 1-liner:

def addTags(s: String) =
  """(``.*?``)""".r replaceAllIn (
                    s, m => "<code>" + m.group(0).replace("``", "") + "</code>")

Test:

scala> addTags("This is a sytax test ``code here`` . Hello ``World``")
res0: String = This is a sytax test <code>code here</code> . Hello <code>World</code>

Upvotes: 0

Brian Hsu
Brian Hsu

Reputation: 8821

After some trial and error, I found the following regex seems work:

"""``(.)*?``"""

Upvotes: 0

Related Questions