Reputation: 100

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.

What is the best approach to parse it, without using other libraries, just basic manipulation of strings?

Keep in mind that I don't have to validate it, just parse and memorize it into a tree.

Upvotes: 2

Answers (3)

Dietmar Kühl

Reputation: 153830

Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:

whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and 

the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.

The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.

What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).

In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.

Upvotes: 6

celtschk

Reputation: 19721

The basic structure of XML is quite simple:

<tagname [attribute[="value"] ...]>content</tagname>

where the content may contain both normal text and more XML structures, or the special form

<tagname [attribute[="value"] ...]/>

which is equivalent to

<tagname [attribute[="value"] ...]></tagname>

that is,. empty content.

So if you don't need to interpret a DTD or do other fancy things, you can do the following:

Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.

Upvotes: 7

Yakov Galka

Reputation: 72479

The best and only approach is to re-implement such a library from scratch without using any other libraries...

You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

Upvotes: 3

Parse an XML in standard C/C++ without additional libraries

Answers (3)

Related Questions