Recursive processing of markup using Regular Expression and DOMDocument

Question

I have been struggling to create a parser for a basic text to HTML markup language I am developing. Inline element markup is as follows.

{*strong*}
{/emphasis/}
{-strikethrough-}
{>small<}
{|code|}

A sample string I am testing against is:

tëstïng 汉字/漢字 testing {*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*} {*wôw*} 1, 2, 3

Using preg_split I can convert this to:

$split = preg_split('%(\{.(?:[^{}]+|(?R))+.\})%',
    $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

array (size=5)
  0 => string 'tëstïng 汉字/漢字 testing ' (length=32)
  1 => string '{*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*}' (length=48)
  2 => string ' ' (length=1)
  3 => string '{*wôw*}' (length=8)
  4 => string ' 1, 2, 3' (length=8)

Then loop through and $dom->createTextNode() or $dom->createElement() + $dom->appendChild($dom->createTextNode()). Unfortunately this does not help when markup is nested.

I am just stumped at an efficient way to recursively process my markup into DOMDocument. I keep reading that I need to write a parser but cannot find a suitable tutorial or code example which I can follow, especially when integrating it with element and text node creation using DOMDocument.

Francis Avila · Accepted Answer

Nested or recursive structures are typically beyond the power of regexes to parse and you usually need a more powerful parser. The problem is that the next token you need to find changes depending on previous tokens, which is not something a regular expression can handle (the language is no longer regular).

However for such a simple language you don't need a full fledged parser-generator with a formal grammar--you can easily write a simple parser by hand. You have only one bit of state that is important -- the last opened tag. If you have a regular expression that matches text, a new open tag, or the corresponding close tag to the currently open tag, you can handle this task. Rules are:

If you match text, save the text and continue matching.
If you match an open tag, save the open tag, and continue matching until you find an open tag or a corresponding close tag.
If you match a close tag, stop looking for the currently open tag and continue matching for the last-unclosed tag, text, or another open tag.

Step two is recursive--whenever you find a new open tag, you create a new matching context that looks for the corresponding close tag.

This isn't required, but generally a parser will produce a simple tree structure to represent the parsed text--this is known as an abstract syntax tree. It's usually better to produce a syntax tree first before you produce what the syntax represents. This gives you flexibility to manipulate the tree or to produce different outputs (e.g., you could output something other than xml.)

Here is a solution that combines both these ideas and parses your text. (It also recognizes {{ or }} as escape sequences meaning a single literal { or }.)

First the parser:

class ParseError extends RuntimeException {}

function str_to_ast($s, $offset=0, $ast=array(), $opentag=null) {
    if ($opentag) {
        $qot = preg_quote($opentag, '%');
        $re_text_suppl = '[^{'.$qot.']|{{|'.$qot.'[^}]';
        $re_closetag = '|(?'.$qot.'\})';
    } else {
        $re_text_suppl = '[^{]|{{';
        $re_closetag = '';
    }
    $re_next = '%
        (?:\{(?P[^{\s]))  # match an open tag
              #which is "{" followed by anything other than whitespace or another "{"
        '.$re_closetag.'  # if we have an open tag, match the corresponding close tag, e.g. "-}"
        |(?P(?:'.$re_text_suppl.')+) # match text
            # we allow non-matching close tags to act as text (no escape required)
            # you can change this to produce a parseError instead
        %ux';
    while ($offset < strlen($s)) {
        if (preg_match($re_next, $s, $m, PREG_OFFSET_CAPTURE, $offset)) {
            list($totalmatch, $offset) = $m[0];
            $offset += strlen($totalmatch);
            unset($totalmatch);
            if (isset($m['opentag']) && $m['opentag'][1] !== -1) {
                list($newopen, $_) = $m['opentag'];
                list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen);
                $ast[] = array($newopen, $subast);
            } else if (isset($m['text']) && $m['text'][1] !== -1) {
                list($text, $_) = $m['text'];
                $ast[] = array(null, $text);
            } else if ($opentag && isset($m['closetag']) && $m['closetag'][1] !== -1) {
                return array($ast, $offset);
            } else {
                throw new ParseError("Bug in parser!");
            }
        } else {
            throw new ParseError("Could not parse past offset: $offset");
        }
    }
    return array($ast, $offset);
}

function parse($s) {
    list($ast, $offset) = str_to_ast($s);
    return $ast;
}

This will produce an abstract syntax tree which is a list of "nodes", where each node is an array of the form array(null, $string) for text or array('-', array(...)) (i.e. the type code and another list of nodes) for stuff inside tags.

Once you have this tree you can do anything you want with it. For example, we can traverse it recursively to produce a DOM tree:

function ast_to_dom($ast, DOMNode $n = null) {
    if ($n === null) {
        $dd = new DOMDocument('1.0', 'utf-8');
        $dd->xmlStandalone = true;
        $n = $dd->createDocumentFragment();
    } else {
        $dd = $n->ownerDocument;
    }
    // Map of type codes to element names
    $typemap = array(
        '*' => 'strong',
        '/' => 'em',
        '-' => 's',
        '>' => 'small',
        '|' => 'code',
    );

    foreach ($ast as $astnode) {
        list($type, $data) = $astnode;
        if ($type===null) {
            $n->appendChild($dd->createTextNode($data));
        } else {
            $n->appendChild(ast_to_dom($data, $dd->createElement($typemap[$type])));
        }
    }
    return $n;
}

function ast_to_doc($ast) {
    $doc = new DOMDocument('1.0', 'utf-8');
    $doc->xmlStandalone = true;
    $root = $doc->createElement('body');
    $doc->appendChild($root);
    ast_to_dom($ast, $root);
    return $doc;
}

Here is some test code with a more difficult test case:

$sample = "tëstïng 汉字/漢字 {{ testing -} {*strông 
    {/ëmphäsïs {-strïkë *}also strike-}/} also {|côdë|}
    strong *} {*wôw*} 1, 2, 3";
$ast = parse($sample);
echo ast_to_doc($ast)->saveXML();

This will print the following:


tëstïng 汉字/漢字 {{ testing -} strông 
    ëmphäsïs strïkë *}also strike also côdë
    strong  wôw 1, 2, 3

If you already have a DOMDocument and you want to add some parsed text to it, I recommend creating a DOMDocumentFragment and passing it to ast_to_dom directly, then appending this to your desired container element.

Recursive processing of markup using Regular Expression and DOMDocument

Answers (2)

Related Questions