esryl
esryl

Reputation: 189

Recursive processing of markup using Regular Expression and DOMDocument

I have been struggling to create a parser for a basic text to HTML markup language I am developing. Inline element markup is as follows.

{*strong*}
{/emphasis/}
{-strikethrough-}
{>small<}
{|code|}

A sample string I am testing against is:

tëstïng 汉字/漢字 testing {*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*} {*wôw*} 1, 2, 3

Using preg_split I can convert this to:

$split = preg_split('%(\{.(?:[^{}]+|(?R))+.\})%',
    $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

array (size=5)
  0 => string 'tëstïng 汉字/漢字 testing ' (length=32)
  1 => string '{*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*}' (length=48)
  2 => string ' ' (length=1)
  3 => string '{*wôw*}' (length=8)
  4 => string ' 1, 2, 3' (length=8)

Then loop through and $dom->createTextNode() or $dom->createElement() + $dom->appendChild($dom->createTextNode()). Unfortunately this does not help when markup is nested.

I am just stumped at an efficient way to recursively process my markup into DOMDocument. I keep reading that I need to write a parser but cannot find a suitable tutorial or code example which I can follow, especially when integrating it with element and text node creation using DOMDocument.

Upvotes: 2

Views: 271

Answers (2)

Francis Avila
Francis Avila

Reputation: 31651

Nested or recursive structures are typically beyond the power of regexes to parse and you usually need a more powerful parser. The problem is that the next token you need to find changes depending on previous tokens, which is not something a regular expression can handle (the language is no longer regular).

However for such a simple language you don't need a full fledged parser-generator with a formal grammar--you can easily write a simple parser by hand. You have only one bit of state that is important -- the last opened tag. If you have a regular expression that matches text, a new open tag, or the corresponding close tag to the currently open tag, you can handle this task. Rules are:

  1. If you match text, save the text and continue matching.
  2. If you match an open tag, save the open tag, and continue matching until you find an open tag or a corresponding close tag.
  3. If you match a close tag, stop looking for the currently open tag and continue matching for the last-unclosed tag, text, or another open tag.

Step two is recursive--whenever you find a new open tag, you create a new matching context that looks for the corresponding close tag.

This isn't required, but generally a parser will produce a simple tree structure to represent the parsed text--this is known as an abstract syntax tree. It's usually better to produce a syntax tree first before you produce what the syntax represents. This gives you flexibility to manipulate the tree or to produce different outputs (e.g., you could output something other than xml.)

Here is a solution that combines both these ideas and parses your text. (It also recognizes {{ or }} as escape sequences meaning a single literal { or }.)

First the parser:

class ParseError extends RuntimeException {}

function str_to_ast($s, $offset=0, $ast=array(), $opentag=null) {
    if ($opentag) {
        $qot = preg_quote($opentag, '%');
        $re_text_suppl = '[^{'.$qot.']|{{|'.$qot.'[^}]';
        $re_closetag = '|(?<closetag>'.$qot.'\})';
    } else {
        $re_text_suppl = '[^{]|{{';
        $re_closetag = '';
    }
    $re_next = '%
        (?:\{(?P<opentag>[^{\s]))  # match an open tag
              #which is "{" followed by anything other than whitespace or another "{"
        '.$re_closetag.'  # if we have an open tag, match the corresponding close tag, e.g. "-}"
        |(?P<text>(?:'.$re_text_suppl.')+) # match text
            # we allow non-matching close tags to act as text (no escape required)
            # you can change this to produce a parseError instead
        %ux';
    while ($offset < strlen($s)) {
        if (preg_match($re_next, $s, $m, PREG_OFFSET_CAPTURE, $offset)) {
            list($totalmatch, $offset) = $m[0];
            $offset += strlen($totalmatch);
            unset($totalmatch);
            if (isset($m['opentag']) && $m['opentag'][1] !== -1) {
                list($newopen, $_) = $m['opentag'];
                list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen);
                $ast[] = array($newopen, $subast);
            } else if (isset($m['text']) && $m['text'][1] !== -1) {
                list($text, $_) = $m['text'];
                $ast[] = array(null, $text);
            } else if ($opentag && isset($m['closetag']) && $m['closetag'][1] !== -1) {
                return array($ast, $offset);
            } else {
                throw new ParseError("Bug in parser!");
            }
        } else {
            throw new ParseError("Could not parse past offset: $offset");
        }
    }
    return array($ast, $offset);
}

function parse($s) {
    list($ast, $offset) = str_to_ast($s);
    return $ast;
}

This will produce an abstract syntax tree which is a list of "nodes", where each node is an array of the form array(null, $string) for text or array('-', array(...)) (i.e. the type code and another list of nodes) for stuff inside tags.

Once you have this tree you can do anything you want with it. For example, we can traverse it recursively to produce a DOM tree:

function ast_to_dom($ast, DOMNode $n = null) {
    if ($n === null) {
        $dd = new DOMDocument('1.0', 'utf-8');
        $dd->xmlStandalone = true;
        $n = $dd->createDocumentFragment();
    } else {
        $dd = $n->ownerDocument;
    }
    // Map of type codes to element names
    $typemap = array(
        '*' => 'strong',
        '/' => 'em',
        '-' => 's',
        '>' => 'small',
        '|' => 'code',
    );

    foreach ($ast as $astnode) {
        list($type, $data) = $astnode;
        if ($type===null) {
            $n->appendChild($dd->createTextNode($data));
        } else {
            $n->appendChild(ast_to_dom($data, $dd->createElement($typemap[$type])));
        }
    }
    return $n;
}

function ast_to_doc($ast) {
    $doc = new DOMDocument('1.0', 'utf-8');
    $doc->xmlStandalone = true;
    $root = $doc->createElement('body');
    $doc->appendChild($root);
    ast_to_dom($ast, $root);
    return $doc;
}

Here is some test code with a more difficult test case:

$sample = "tëstïng 汉字/漢字 {{ testing -} {*strông 
    {/ëmphäsïs {-strïkë *}also strike-}/} also {|côdë|}
    strong *} {*wôw*} 1, 2, 3";
$ast = parse($sample);
echo ast_to_doc($ast)->saveXML();

This will print the following:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<body>tëstïng 汉字/漢字 {{ testing -} <strong>strông 
    <em>ëmphäsïs <s>strïkë *}also strike</s></em> also <code>côdë</code>
    strong </strong> <strong>wôw</strong> 1, 2, 3</body>

If you already have a DOMDocument and you want to add some parsed text to it, I recommend creating a DOMDocumentFragment and passing it to ast_to_dom directly, then appending this to your desired container element.

Upvotes: 5

Sepster
Sepster

Reputation: 4848

If you have a regex that captures the content of your outermost open/close pair, you could then wrap that captured content in your equivalent HTML tags, then recurse into that new string by repeating the same regex (which would the capture the content of the second-to-outermost pair), and so on.

The problem with this approach is that if/when an opening "tag" is not properly closed, the entire content is lost and then you can't recurse into it.

A more reliable approach might be to parse the text from beginning to end, and when you encounter an opening tag you add it and its position, onto a stack. Whenever a closing tag is encountered, it is ignored if it doesn't match the opening tag at the top of the stack, or if it does match then replace the current closing tag with the equivalent HTML closing tag and pop the opening tag from the stack (and replace it with the equivalent opening HTML tag at the recorded position).

A simple algorithm for parsing might be to find the first instance of your opening or closing tags ( eg using this regex (\{[-*/>|])|(\}[-*/<|]) ), then process as above, then repeat that search from the current position to find the next tag, etc...

Upvotes: 1

Related Questions