Strigger
Strigger

Reputation: 1903

XML Parsing: Checking for strings within string C++

I have written a simple C++ shell program to parse large XML files and fix syntax errors.

I have so far covered everything I can think of except strings within strings, for example.

<ROOT>
  <NODE attribute="This is a "string within" a string" />
<ROOT>

My program loops through the entire xml file character by character(keeping only a few characters in memory at a time for efficiency), it looks for things such as &<> etc and escapes them with &amp; &gt; &lt; etc. A basic example of what I am doing can be found at the accepted answer for this Escaping characters in large XML files

The question is: What conditions or logic can I use to detect "string within" to be able to escape the quotes to this:

<ROOT>
  <NODE attribute="This is a &quot;string within&quot; a string" />
<ROOT>

Is it even possible at all?

Upvotes: 1

Views: 474

Answers (2)

Sebastian
Sebastian

Reputation: 2974

I think it's difficult to decide where the attribute ends and another begins. I think you need to restrict the possible input you can parse otherwise you will have ambiguous cases such as this one:

<ROOT>
  <NODE attribute="This is a "string within" a string" attribute2="This is another "string within" a string" />
<ROOT>

These are either two attributes or one attribute.

One assumption you could make is that after an equal number of double quotes and an equal sign a new attribute begins. Then you simply replace all the inner double quotes with your escape string. Or any equal sign after 2 ore more double quotes means new attribute. The same could be assumed for the end of node.

Upvotes: 1

Eclipse
Eclipse

Reputation: 45533

The better solution would be to fix these kind of errors before they are created. XML is designed to be super strict to avoid having to make these kind of guesses. If the XML is invalid, the only thing you should do, is reject it, and output a helpful error message.

Who's to say that your correction:

<NODE attribute="This is a &quot;string within&quot; a string" />

is better than

<NODE attribute="This is a " string-within=" a string" />

Obviously, with the benefit of understanding English, we can be pretty certain that it's the former, but when you're taking an automated approach to it, there's no way to be certain that you're not covering up a more serious error.

The place to fix escaping issues is when you're creating the xml file.

Upvotes: 4

Related Questions