user360872
user360872

Reputation: 1361

Simple XML parser in bison/flex

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>.

So for example:

<div>
  <mul>
    <num>20</num>
    <add>
      <num>1</num>
      <num>5</num>
    </add>
  </mul>
  <id>test</id>
</div>

If it helps, I know the names of all tags that may occur. I know how many sub-tag can be hold by given tag. Is it possible to create bison parser that would do something like that:

- new Tag("num", 1)           // tag1
- new Tag("num", 5)           // tag2
- new Tag("add", tag1, tag2)  // tag3
- new Tag("num", 20)          // tag4
- new Tag("mul", tag4, tag3)
...
- root = top_tag

Tag & number of sub-tags:

Could you help me with grammar to be able to create AST like given above?

Upvotes: 4

Views: 9515

Answers (2)

Dennis
Dennis

Reputation: 51

For your requirements, I think the yax system would work well. From the README:

The goal of the yax project is to allow the use of YACC (Gnu Bison actually) to parse/process XML documents.

The key piece of software for achieving the above goal is to provide a library that can produce an XML lexical token stream from an XML document.

This stream can be wrapped to create an instance of yylex() to feed tokens to a Bison grammar to parse and process the XML document.

Using the stream plus a Bison grammar, it is possible to carry at least the following kinds of activities.

  1. Validate XML documents,
  2. Directly parse XML documents to create internal data structures,
  3. Construct DOM trees.

Upvotes: 5

VGE
VGE

Reputation: 4191

I do not think that it's the best tool to use to create a xml parser. If I have to do this job, I'll do it by hand.

Flex code will contains : NUM match integer in this example. STR match match any string which does not contains a '<' or '>'. STOP match all closing tags. START match starting tags.

<\?.*\?> { ;} 
<[a-z]+> { return START; }
</[a-z]+> { return STOP; }
[0-9]+ { return NUM; }
[^><]+ { return STR; }

Bison code will look like

%token START, STOP, STR, NUM
%%
simple_xml : START value STOP
;
value : simple_xml 
| STR
| NUM
| value simple_xml
;

Upvotes: 2

Related Questions