regex to extract nested elements

Question

I have text document which contains following structure

{1,"StructEx",
[{1,"Element_1",[{2,"name","exampleName"},{3,"exampleValue",1},{2,"exampleComment","foo"}]},
[{1,"Element_2",[{2,"name","exampleName2"},{3,"exampleValue",2},{2,"exampleComment","bar"}]}]}

Where first value in parentheses is data type. I need regex expression that will return (through couple of iterations) all elements in StructEx so I can pack it in something like this

["StructEx"]=>
array(2) {
["Element_1"]=>
array(3) {
  ["name"]=>
  string(11) "exampleName"
  ["exampleValue"]=>
  int(1)
  ["exampleComment"]=>
  string(3) "foo"
}
["Element_2"]=>
array(3) {
  ["name"]=>
  string(12) "exampleName2"
  ["exampleValue"]=>
  int(2)
  ["exampleComment"]=>
  string(3) "bar"
}
}

gramonov · Accepted Answer

Assuming nesting is arbitrary deep, then the answer is no, you cannot parse this kind of text document with regex. If the depth is finite, it's painful, but doable.

Why the heck is that?

Alphabets

To answer your question in more details, let's introduce some awesome dry theory.

Let's define an alphabet Σ as an non-empty set of symbols (a more technically correct definition would come from category theory by treating an alphabet as a non-empty free object, but for the sake of the argument this definition will suffice.

In our alphabet Σ, we can define a set of all finite strings (read: words) over the alphabet, that is:

Σ = {s₁, s₂, ... ,s_n}

Σ* = {ε} ∪ {s_i1s_i2...s_im | s_ik ∈ Σ, m > 0, 1 ≤ k ≤ n}, where ε is an empty string

As an example, it means that if we have an alphabet Σ = {a, b, c} and some of the words in Σ* will be aaaaaa, abababa, but not abd, because we don't even know that d exists.

Regular expressions & languages

Given the alphabet Σ, we have regular expressions like ab*|c. I'm skipping formal definition of regex to make it less confusing, so let's assume it's our plain old "practical" regex.

Each regular expression defines defines a regular language, e.g. in this example the language consists of words a, ab, c, abbbbbc, but not abc.

Finite automata

Each regular language can be expressed as a finite automaton, a device that can recognize regular expressions. For the aforementioned regular expression ab*|c, an automaton looks like this:

ab*|c

0 is a start state, double circles are accept states. In short, the automaton starts in state 0 consumes each letter of a word and moves according to the transition arrows. If it ends up in the accept state, we say it accepts the string. Otherwise, we say it rejects it.

So in this case, feeding a string abb into our machine:

Start in state 0
Consume a, move to state 1
Consume b, move to state 3
Consume b, move to state 3
String is empty and state 3 is an accept state, so this machine accepts the string or, equivalently, our regex pattern matches the string.

Let's see what happens when we feed abc into our machine:

Start in state 0
Consume a, move to state 1
Consume b, move to state 3
Consume c, nowhere to move, string is rejected

So our regex does not match abc. All of this is basically the same as practical regex with some added theory.

Equivalence

There is a theorem that states that each regular language is finite automata-recognizable. That means, if there exists a regular language (and an underlying regular expression) that can match your desired pattern, there should be an equivalent finite automaton.

So, why not?

But the nesting in your pattern has infinite depth. Therefore, you will need an infinitely large finite automaton that is equivalent to the regular language, which is contradictory to the definition of a finite automaton.

Reference

Regex & Automata

Disclaimer

As you can see, I skipped inductive definition of regex, finite automaton from category theory perspective, closures under operations, and some other formal stuff. You are welcome to read about it in the aforementioned reference links.