Eran Medan
Eran Medan

Reputation: 45705

Replace text in an XML file using XPath while preserving formatting

I would like to replace text in an XML file, but preserve any other formatting in the source file.

E.g. parsing it as DOM, replacing the node using XPath and output as String might not do the trick as it will reformat the entire file. (pretty printing might be good for 99% of the cases, but the requirement is to preserve existing formatting, even if it's not "pretty")

Is there any Java / Scala library that can do a "find and replace" on a String, without parsing it as a DOM tree? or at least be able to preserve the original formatting?

EDIT:

I think that the maven replacer plugin does something like this, it seems that it preserves original whitespace formatting by using setPreserveSpace (I think, need to try)

import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer; 
...
   private String writeXml(Document doc) throws Exception {
            OutputFormat of = new OutputFormat(doc);
            of.setPreserveSpace(true);
            of.setEncoding(doc.getXmlEncoding());

            StringWriter sw = new StringWriter();
            XMLSerializer serializer = new XMLSerializer(sw, of);
            serializer.serialize(doc);
            return sw.toString();
    }

So the question changes to: Is there a (straight forward) way to do so without extra dependencies?

EDIT2:

The requirement is to use an XPath query provided externally, i.e. as a String.

Upvotes: 4

Views: 824

Answers (2)

som-snytt
som-snytt

Reputation: 39577

I was going to code up something quick to recall scala.xml and how much I dislike it; I haven't used it since I first learned some Scala.

You normally see text nodes of white space -- this is mentioned in PiS, in the "catalog" example here.

I did remember that it reverses attributes on load -- I vaguely remembered having to fix pretty printing.

But the compiler doesn't reverse attributes on xml literals. So given that you want to supply an xpath dynamically, you could use the compiler toolbox to compile the source document as a literal and also compile the xpath string, with / operators converted to \.

That's just a little out-of-the-box fun, but maybe it has a sweet spot of applicability, perhaps if you must use only the standard Scala distro.

I'll update later when I get a chance to try it out.

import scala.xml._
import java.io.File

object Test extends App {
  val src =
"""|<doc>
   |  <foo bar="red" baz="yellow"> <bar> red </bar> </foo>
   |  <baz><bar>red</bar></baz>
   |</doc>""".stripMargin

  val red = "(.*)red(.*)".r
  val sub = "blue"

val tmp =
<doc>
   <foo bar="red" baz="yellow"> <bar> red </bar> </foo>
   <baz><bar>red</bar></baz>
</doc>

  Console println tmp

  // replace "red" with "blue" in all bar text

  val root = XML loadString src
  Console println root
  val bars = root \\ "bar"
  val barbars =
    bars map (_ match {
      case <bar>{Text(red(prefix, suffix))}</bar> =>
           <bar>{Text(s"$prefix$sub$suffix")}</bar>
      case b => b
    })
  val m = (bars zip barbars).toMap
  val sb = serialize(root, m)
  Console println sb

  def serialize(x: Node, m: Map[Node, Node], sb: StringBuilder = new StringBuilder) = {
    def serialize0(x: Node): Unit = x match {
      case e0: Elem =>
        val e = if (m contains e0) m(e0) else e0
        sb append "<"
        e nameToString sb
        if (e.attributes ne null) e.attributes buildString sb
        if (e.child.isEmpty) sb append "/>"
        else {
          sb append ">"
          for (c <- e.child) serialize0(c)
          sb append "</"
          e nameToString sb
          sb append ">"
        }
      case Text(t) => sb append t
    }
    serialize0(x)
    sb
  }
}

Upvotes: 2

stefan.schwetschke
stefan.schwetschke

Reputation: 8932

You can try scala.xml.pull or Scales XML.

You can find working code for parsing files here.

Scales XML can use the STAX API, which is a streaming API. So there is never a full DOM and usually the parts of the XML are reached through without too much pre-processing.

Test it with your special formatted XML file and look if it works out.

I would not recommend to use simple text search and replace with XML. There is a good chance of a mismatch. You will then alter the document in a unpredictable way. The resulting bugs are usually hard to find.

I have made a short experiment with Scales XML and it looks quite promising:

    scala> import scales.utils._
    import scales.utils._
    scala> import ScalesUtils._
    import ScalesUtils._
    scala> import scales.xml._
    import scales.xml._
    scala> import ScalesXml._
    import ScalesXml._
    scala> import scales.xml.serializers.StreamSerializer
    import scales.xml.serializers.StreamSerializer
    scala> import java.io.StringReader
    import java.io.StringReader
    scala> import java.io.PrintWriter
    import java.io.PrintWriter

    scala> def xmlsrc=new StringReader("""
         | <a attr1="value1"> <b/>This
         | is some tex<xt/>
         |   <!-- A comment -->
         |   <c><d>
         |   </d>
         |   <removeme/>
         |   <changeme/>
         | </c>
         | </a>
         | """)
    xmlsrc: java.io.StringReader

    scala> def pull=pullXml(xmlsrc)
    pull: scales.xml.XmlPull with java.io.Closeable with scales.utils.IsClosed

    scala> writeTo(pull, new PrintWriter(System.out))
    <?xml version="1.0" encoding="UTF-8"?><a attr1="value1"> <b/>This
    is some tex<xt/>
      <!-- A comment -->
      <c><d>
      </d>
      <removeme/>
      <changeme/>
    </c>
    res0: Option[Throwable] = None

    scala> def filtered=pull flatMap {
         |   case Left(e : Elem) if e.name.local == "removeme" => Nil
         |   case Right(e : EndElem) if e.name.local == "removeme" => Nil
         |   case Left(e : Elem) if e.name.local == "changeme" => List(Left(Elem("x")), Left(Elem("y"
     Right(EndElem("x")))
         |   case Right(e : EndElem) if e.name.local == "changeme" => List(Right(EndElem("x")))
         |   case otherwise => List(otherwise)
         | }
    filtered: Iterator[scales.xml.PullType]

    scala> writeTo(filtered, new PrintWriter(System.out))
    <?xml version="1.0" encoding="UTF-8"?><a attr1="value1"> <b/>This
    is some tex<xt/>
      <!-- A comment -->
      <c><d>
      </d>

      <x><y/></x>
    </c>
    res1: Option[Throwable] = None

The example first initializes the XML token stream. Then it prints the token stream unmodified. You can see, that comments and formatting are preserved. Then it modifies the token stream with the monadic Scala API and prints the result. You can see that most formatting is preserved and only the formatting of the changed parts differs.

So it looks like Scales XML solves your problem in a straight forward way.

Upvotes: 1

Related Questions