FPGA
FPGA

Reputation: 3865

Implementing an xml parser in c

i am looking at building a simple xml parser with c99, i want to implement every single detail to it just for learning purposes, from my understanding the best way is implementing a tree structure and tokenizing the xml string into a tree structure so it will look something likeenter image description here

and i will have 2 simple structs one that represents a node and one that represents an attribute, how bad is the above design?

any suggestions for improvement?

Upvotes: 4

Views: 1690

Answers (2)

Michael Kay
Michael Kay

Reputation: 163342

In looking at the design of the tree, it's worth writing down and prioritising your objectives, because they help you make decisions about trade-offs. I think there are probably three key metrics: time for building the tree, time for navigating the tree (typically in top-down recursive descent), and space occupancy. Plus development effort, of course. The other important factor is whether you want the tree to be mutable (e.g. as required by DOM) or immutable (e.g. for XPath/XSLT/XQuery).

Other factors specific to XML: how much information do you want to retain in the tree? E.g. do you want to retain CDATA section boundaries? And entity references? Or do you want to expand these inline?

I don't know what c99 is, that may impose additional constraints or provide opportunities.

Upvotes: 1

M Oehm
M Oehm

Reputation: 29126

The complexity of your chosen task aside, your data structure looks good at first sight, but in my opinion there are two or three things wrong:

  • You'll have to account not only for child nodes, but also for sibling nodes that share the same parent
  • There's no need to make the sttribute tree a binary tree. For simplicity, I'd just use a singly-linked list.
  • You need to account for the contents of the nodes between the opening and closing brackets (unless your node structure already accounts fot it.)

So you really need a binary tree for the xml structure itself and a linked list of attributes for each node. For example, consider this simple xml-style data:

<dinner time="19:00" dresscode="informal">
    <course id="starter">
        <food>Consomme</food>
        <food>Tomato soup</food>
    <course>
    <course id="salad" optional=optional>
        <food>Green salad</food>
    <course>
    <course id="main">
        <food>Steak and kidney pie</food>
        <food type=vegetarian>Spinach lasagna</food>
    <course>
    <course id="dessert">
        <food>Fruit</food>
        <food>Ice cream</food>
        <food>Coffee</food>
    <course>
</dinner>

The food items are the children of the courses, but are siblings of each other if they have the same course as parent. The tree structure looks like the indentation: Items on the same level are siblings, indented items are children.

You need only keep a pointer to the oldest child, other children are reachable via the sibling relationship, which is also a pointer. (In binary-tree nomenclature, children are the left links and siblings are the right links.) For easy traversing you should also keep a pointer to the parent.

The textual content and the attributes are just data attached to the nodes.

(Of course, looking at the source of existing XML parsers might give you better ideas.)

Upvotes: 2

Related Questions