Vincenz Dreger
Vincenz Dreger

Reputation: 53

PHP: String to multidimensional array

(Sorry for my bad English)

I have a string that I want to split into an array. The corner brackets are multiple nested arrays. Escaped characters should be preserved.

This is a sample string:

$string = '[[["Hello, \"how\" are you?","Good!",,,123]],,"ok"]'

The result structure should look like this:

array (
  0 => 
  array (
    0 => 
    array (
      0 => 'Hello, \"how\" are you?',
      1 => 'Good!',
      2 => '',
      3 => '',
      4 => '123',
    ),
  ),
  1 => '',
  2 => 'ok',
)

I have tested it with:

$pattern = '/[^"\\]*(?:\\.[^"\\]*)*/s';
$return = preg_match_all($pattern, $string, null);

But this did not work properly. I do not understand these RegEx patterns (I found this in another example on this page). I do not know whether preg_match_all is the correct command.

I hope someone can help me.

Many Thanks!!!

Upvotes: 5

Views: 145

Answers (3)

Jan
Jan

Reputation: 43189

You might want to use a lexer in combination with a recursive function that actually builds the structure.

For your purpose, the following tokens have been used:

\[           # opening bracket
\]           # closing bracket
".+?(?<!\\)" # " to ", making sure it's not escaped
,(?!,)       # a comma, not followed by a comma
\d+          # at least one digit
,(?=,)       # a comma followed by a comma

The rest is programming logic, see a demo on ideone.com. Inspired by this post.


class Lexer {
    protected static $_terminals = array(
        '~^(\[)~'               => "T_OPEN",
        '~^(\])~'               => "T_CLOSE",
        '~^(".+?(?<!\\\\)")~'   => "T_ITEM",
        '~^(,)(?!,)~'           => "T_SEPARATOR",
        '~^(\d+)~'              => "T_NUMBER",
        '~^(,)(?=,)~'           => "T_EMPTY"
    );

    public static function run($line) {
        $tokens = array();
        $offset = 0;
        while($offset < strlen($line)) {
            $result = static::_match($line, $offset);
            if($result === false) {
                throw new Exception("Unable to parse line " . ($line+1) . ".");
            }
            $tokens[] = $result;
            $offset += strlen($result['match']);
        }
        return static::_generate($tokens);
    }

    protected static function _match($line, $offset) {
        $string = substr($line, $offset);

        foreach(static::$_terminals as $pattern => $name) {
            if(preg_match($pattern, $string, $matches)) {
                return array(
                    'match' => $matches[1],
                    'token' => $name
                );
            }
        }
        return false;
    }

    // a recursive function to actually build the structure
    protected static function _generate($arr=array(), $idx=0) {
        $output = array();
        $current = 0;
        for($i=$idx;$i<count($arr);$i++) {
            $type = $arr[$i]["token"];
            $element = $arr[$i]["match"];
            switch ($type) {
                case 'T_OPEN':
                    list($out, $index) = static::_generate($arr, $i+1);
                    $output[] = $out;
                    $i = $index;
                    break;
                case 'T_CLOSE':
                    return array($output, $i);
                    break;
                case 'T_ITEM':
                case 'T_NUMBER':
                    $output[] = $element;
                    break;
                case 'T_EMPTY':
                    $output[] = "";
                    break;
            }
        }
        return $output;
    }    
}

$input  = '[[["Hello, \"how\" are you?","Good!",,,123]],,"ok"]';
$items = Lexer::run($input);
print_r($items);

?>

Upvotes: 0

Robin Mackenzie
Robin Mackenzie

Reputation: 19289

This is a tough one for a regex - but there is a hack answer to your question (apologies in advance).

The string is almost a valid array literal but for the ,,s. You can match those pairs and then convert to ,''s with

/,(?=,)/

Then you can eval that string into the output array you are looking for.

For example:

// input 
$str1 = '[[["Hello, \\"how\\" are you?","Good!",,,123]],,"ok"]';

// replace , followed by , with ,'' with a regex
$pattern = '/,(?=,)/';
$replace = ",''";
$str2 = preg_replace($pattern, $replace, $str1);

// eval updated string
$arr = eval("return $str2;");
var_dump($arr);

I get this:

array(3) {
  [0]=>
  array(1) {
    [0]=>
    array(5) {
      [0]=>
      string(21) "Hello, "how" are you?"
      [1]=>
      string(5) "Good!"
      [2]=>
      string(0) ""
      [3]=>
      string(0) ""
      [4]=>
      int(123)
    }
  }
  [1]=>
  string(0) ""
  [2]=>
  string(2) "ok"
}

Edit

Noting the inherent dangers of eval the better option is to use json_decode with the code above e.g.:

// input 
$str1 = '[[["Hello, \\"how\\" are you?","Good!",,,123]],,"ok"]';

// replace , followed by , with ,'' with a regex
$pattern = '/,(?=,)/';
$replace = ',""';
$str2 = preg_replace($pattern, $replace, $str1);

// eval updated string
$arr = json_decode($str2);
var_dump($arr);

Upvotes: 2

Magnavode
Magnavode

Reputation: 53

If you can edit the code that serializes the data then it's a better idea to let the serialization be handled using json_encode & json_decode. No need to reinvent the wheel on this one.

Nice cat btw.

Upvotes: 1

Related Questions