Hanno Fietz
Hanno Fietz

Reputation: 31360

Standard algorithm to tokenize a string, keep delimiters (in PHP)

I want to split an arithmetic expression into tokens, to convert it into RPN.

Java has the StringTokenizer, which can optionally keep the delimiters. That way, I could use the operators as delimiters. Unfortunately, I need to do this in PHP, which has strtok, but that throws away the delimiters, so I need to brew something myself.

This sounds like a classic textbook example for Compiler Design 101, but I'm afraid I'm lacking some formal education here. Is there a standard algorithm you can point me to?

My other options are to read up on Lexical Analysis or to roll up something quick and dirty with the available string functions.

Upvotes: 2

Views: 1661

Answers (3)

Hanno Fietz
Hanno Fietz

Reputation: 31360

OK, thanks to PhiLho, my final code is this, should anyone need it. It's not even really dirty. :-)

static function rgTokenize($s)
{
    $rg = array();

    // remove whitespace
    $s = preg_replace("/\s+/", '', $s);

    // split at numbers, identifiers, function names and operators
    $rg = preg_split('/([*\/^+\(\)-])|(#\d+)|([\d.]+)|(\w+)/', $s, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

    // find right-associative '-' and put it as a sign onto the following number
    for ($ix = 0, $ixMax = count($rg); $ix < $ixMax; $ix++) {
        if ('-' == $rg[$ix]) {
            if (isset($rg[$ix - 1]) && self::fIsOperand($rg[$ix - 1])) {
                continue;
            } else if (isset($rg[$ix + 1]) && self::fIsOperand($rg[$ix + 1])) {
                $rg[$ix + 1] = $rg[$ix].$rg[$ix + 1];
                unset($rg[$ix]);
            } else {
                throw new Exception("Syntax error: Found right-associative '-' without operand");
            }
        }
    }
    $rg = array_values($rg);

    echo join(" ", $rg)."\n";

    return $rg;
}

Upvotes: 0

PhiLho
PhiLho

Reputation: 41132

As often, I would just use a regular expression to do this:

$expr = '(5*(7 + 2 * -9.3) - 8 )/ 11';
$tokens = preg_split('/([*\/^+-]+)\s*|([\d.]+)\s*/', $expr, -1,
        PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$tts = print_r($tokens, true);
echo "<pre>x=$tts</pre>";

It needs a little more work to accept numbers with exponent (like -9.2e-8).

Upvotes: 1

Shoan
Shoan

Reputation: 4078

This might help.

Practical Uses of Tokenizer

Upvotes: 2

Related Questions