Fibo Kowalsky
Fibo Kowalsky

Reputation: 1248

How do I parse a string representing a recursive data structure using Perl regular expressions?

I wonder what methods are in Perl for traversing a recursive structure (e.g. binary tree) which is given as a string.

More concretely:

Here is a tree, for simplicity is parse tree and very short. imagine it is string without fancy tabbing and spaces.

tree(Sentence, 
  tree(NounPhrase,
    leaf(Determiner, "a"),
    leaf(Noun, "man", "singular")
  ), 
  tree(VerbPhrase,
    leaf(Verb, "walks", "present", "3rd person")
  )
)

Now I want to access two direct child nodes of the root, but I cannot do this with regular expressions simply.

m/tree \( \w+ , (group1) , (group2) \) /x

I would like to capture group1 and group2 correctly, i.e. group1 and group2 having even number of opening and closing parentheses.

It seems quite complicated task and wonder what is the common/simplest solution to it?

For example, prolog will easily digest this task.

Upvotes: 3

Views: 479

Answers (3)

scozy
scozy

Reputation: 2582

If you know how many children to expect, which your example regex suggests, then it is fairly easy, and something like this will be enough:

my @children = m{ tree\(  \w+?, ( (?:tree|leaf)\(.+\) ), ( (?:tree|leaf)\(.+\) ) \) }x;

If you don't, which seems more likely, then it is indeed not simple, but it is possible. In his book on regular expressions, Jeffrey Friedl suggests using what he calls the dynamic regex construct to build a recursive pattern in order to match nested pairs.

# first, strip your string
s{ ^ tree\( \w+ , (.+) \) $ }{$1}x;

# then, define the recursive pattern to match paired parentheses
my $recursion;
$recursion = qr{ (?> [^()]+ | \( (??{ $recursion }) \) )* }x;

# finally, match!
my @children = m{ ( (?: tree | leaf ) \( $recursion \) ) ,?}gx;

In perlre, this is called postponed regular subexpression, and is noted as an experimental feature.

Upvotes: 0

Fibo Kowalsky
Fibo Kowalsky

Reputation: 1248

Ok, thanks, so the answer is "Simply, it is not possible only with RegEx".

Upvotes: 0

Tudor Constantin
Tudor Constantin

Reputation: 26871

I would try by creating 2 functions: sub tree{} and sub leaf{}

each of them would return a tagged term as a string, for example leaf(Determiner, "a") would return <Determiner>a</Determiner>

then simply execute the file you want to process. The output would be a DOM like structure which you can parse with any DOM parser like XML::DOM for example

Upvotes: 2

Related Questions