ROLO18
ROLO18

Reputation: 11

Replacing/Removing XML file lines

I'm pretty new at writing scripts. But I'm having a hard time figuring out how to write a simple Macro in VBA to allow multiple xml files to be edited and saved by either removing or replacing a certain text.

For instance I need to delete, from multiple XML files, <!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES... etc

the other thing is I need to replace "&deg" to "°deg" within the same XML files.

Thank you for any help.

Upvotes: 1

Views: 478

Answers (1)

G_H
G_H

Reputation: 12019

What's important to note about XML documents is that they're structured data. Trying to alter the XML simply by doing text replacements could lead to malformed XML. Take for example the following XML element:

<test>1 is lower than 2 and 3 is greater than 2</test>

A naive replacement of "is lower than" with "<" and "is greater than" with ">" would yield the following:

<test>1 < 2 and 3 > 2</test>

Now your XML syntax is ruined because the characters < and > are used for markup. An XML parser would no longer know if it's intended as simple text or part of an element declaration. So in order to use them as plain text they ought to be represented by entity references:

<test>1 &lt; 2 and 3 &gt; 2</test>

An XML parser which reads the above would do the necessary substitution and if you asked it what the text content of the <test> element is, it would respond with 1 < 2 and 3 > 2.

What I'm getting at is that the proper way to be manipulating your XML input is to use libraries which "understand" the XML markup. For VBA you could use MSXML. Info can be found here: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms763742(v=vs.85)?redirectedfrom=MSDN

Some options supported by MSXML you have for deleting certain nodes or doing text replacements:

  • Use SAX2, the Simple API for XML parsing. SAX uses an event-based approach where parsing the XML generates "events" such as the start of an element, text data, processing instruction etc. You could check the type of event, make necessary adjustments (for example if it's a text event do the necessary replacements, for certain processing instructions just leave them out entirely) and then supply the modified event to a writer outputting the updated XML file. This is fairly low-level but can operate very quickly and without much memory overhead.
  • Use DOM. The "Document Object Model" is an in-memory representation of the XML document. You can traverse the tree of XML nodes and manipulate then as needed: removing, adding, replacing... It's often more intuitive than SAX but the disadvantage is that the entire file will be read to memory with a lot of overhead for the model. This may not work well for larger documents.
  • Use XSLT. This is a transformation language for XML input. XSLT files are themselves written in XML. They describe how XML input is transformed and can create output as XML, HTML or plain text. It would require some additional studying to understand XSLT itself but there's lots of resources out there, including plenty of relevant questions here on StackOverflow. Both filtering out certain nodes and doing text replacements is straightforward enough with XSLT and usually it's very fast and memory-efficient. This would be the most high-level approach.

The main point is that all the above will make sure you're outputting well-formed XML and these technologies are intended for reading and manipulating XML documents. I suggest you check out MSXML and do some playing around with these APIs and a simple input document to get a feel for what's possible and what you prefer.

Upvotes: 1

Related Questions