How do I construct a parse tree from a series of S-expression tokens in Prolog?

Question

I'm curious about Prolog as a parser, so I'm making a little Lisp front-end. I have already made a tokenizer, which you can see here:

base_tokenize([], Buffer, [Buffer]).
base_tokenize([Char | Chars], Buffer, Tokens) :-
    (Char = '(' ; Char = ')') ->
        base_tokenize(Chars, '', Tail_Tokens),
        Tokens = [Buffer, Char | Tail_Tokens];

    Char = ' ' ->
        base_tokenize(Chars, '', Tail_Tokens),
        Tokens = [Buffer | Tail_Tokens];

    atom_concat(Buffer, Char, New_Buffer),
    base_tokenize(Chars, New_Buffer, Tokens).

filter_empty_blank([], []).
filter_empty_blank([Head | Tail], Result) :-
    filter_empty_blank(Tail, Tail_Result),
    ((Head = [] ; Head = '') ->
        Result = Tail_Result;
        Result = [Head | Tail_Result]).

tokenize(Expr, Tokens) :-
    atom_chars(Expr, Chars),
    base_tokenize(Chars, '', Dirty_Tokens),
    filter_empty_blank(Dirty_Tokens, Tokens).

I now have a new challenge: construct a parse tree from this. First, I tried making one without a grammar, but that turned out really messy. So I'm using DCGs. Wikipedia's page on it is not very clear - especially the portion Parsing with DCGs. Maybe someone can give me a clearer idea of how I would construct a tree? I was very happy to know that Prolog's lists are untyped, so it's a bit easier now that no sum types are needed. I'm just really confused about inputs to grammar clauses like sentence(s(NP,VP)) or verb(v(eats)) (on the Wiki), why the arguments have such abstruse names, and how I can get started with my parser without too much hassle.

expr --> [foo].
expr --> list.

seq --> expr, seq.
seq --> expr.

list --> ['('], seq, [')'].

David Tonhofer · Accepted Answer

Here is a beginning: Parsing a LISP list-of-atom, which at first is unstructured list-of-token:

List = [ '(', '(', foo, bar, ')', baz ')' ].

First, just accept it.

Write down the grammar directly:

so_list         --> ['('], so_list_content, [')'].

so_list_content --> [].
so_list_content --> so_atom, so_list_content.
so_list_content --> so_list, so_list_content.

so_atom         --> [X], { \+ member(X,['(',')']),atom(X) }.

Add some test cases (is there plunit in GNU Prolog?)

:- begin_tests(accept_list).

test(1,[fail])        :- phrase(so_list,[]).
test(2,[true,nondet]) :- phrase(so_list,['(',')']).
test(3,[true,nondet]) :- phrase(so_list,['(',foo,')']).
test(4,[true,nondet]) :- phrase(so_list,['(',foo,'(',bar,')',')']).
test(5,[true,nondet]) :- phrase(so_list,['(','(',bar,')',foo,')']).
test(6,[fail])        :- phrase(so_list,['(',foo,'(',bar,')']).

:- end_tests(accept_list).

And so:

?- run_tests.
% PL-Unit: accept_list ...... done
% All 6 tests passed
true.

Cool. Looks like we can accept lists-of-tokens.

Now build a parse tree. This is done by growing a Prolog term through parameters of the "DCG predicates". The term (or multiple terms) in the head collect the terms (or multiple terms) appearing in the body into a larger structure, quite naturally. Once the terminal tokens are reached, the structure starts to fill up with actual content:

so_list(list(Stuff))   --> ['('], so_list_content(Stuff), [')'].

so_list_content([])        --> [].
so_list_content([A|Stuff]) --> so_atom(A), so_list_content(Stuff).
so_list_content([L|Stuff]) --> so_list(L), so_list_content(Stuff).

so_atom(X) --> [X], { \+ member(X,['(',')']),atom(X) }.

Yup, tests (move the expected Result out of the test head because the visual noise is too much)

:- begin_tests(parse_list).

test(1,[fail]) :-
   phrase(so_list(_),[]).

test(2,[true(L==Result),nondet]) :-
   phrase(so_list(L),['(',')']),
   Result = list([]).
   
test(3,[true(L==Result),nondet]) :-
   phrase(so_list(L),['(',foo,')']),
   Result = list([foo]).
   
test(4,[true(L==Result),nondet]) :-
   phrase(so_list(L),['(',foo,'(',bar,')',')']),
   Result = list([foo,list([bar])]).
   
test(5,[true(L==Result),nondet]) :-
   phrase(so_list(L),['(','(',bar,')',foo,')']),
   Result = list([list([bar]),foo]).
   
test(6,[fail]) :-
   phrase(so_list(_),['(',foo,'(',bar,')']).

:- end_tests(parse_list).

And so:

?- run_tests.
% PL-Unit: parse_list ...... done
% All 6 tests passed
true.

How do I construct a parse tree from a series of S-expression tokens in Prolog?

Answers (1)

Related Questions