Sean Allred
Sean Allred

Reputation: 3658

Pretty-print XML (with attribute alignment)

This is a follow-up question to How to pretty print XML from the command line?.

Is there any tool in libxml2 that will allow me to align the attributes of each node as well? I have a large XML document whose logical structure I cannot change, but I would like to turn

<a attr="one" bttr="two" tttr="three" fttr="four"/>

into

<a attr   = "one"
   bttr   = "two"
   tttr   = "three"
   fttr   = "four"
   longer = "attribute" />

Upvotes: 4

Views: 1989

Answers (2)

StackzOfZtuff
StackzOfZtuff

Reputation: 3106

Try xml_pp with style "-s cvs"

You asked for something in libxml2. I don't know about that. But if you are willing to use something else, then read on below.

xml_pp is part of the XML::Twig library and has a bunch of different preconfigured styles.

You can specify a style via the "-s" (style) parameter.

If you just leave "-s" empty, then it will show all available styles. (It actually generate that list on the fly. So it's guaranteed to be fresh.)

$ xml_pp -s
Use of uninitialized value $opt{"style"} in hash element at /usr/bin/xml_pp line 100.
usage: /usr/bin/xml_pp [-v] [-i<extension>] [-s (none|nsgmls|nice|indented|indented_close_tag|indented_c|wrapped|record_c|record|cvs|indented_a)] [-p <tag(s)>] [-e <encoding>] [-l] [-f <file>] [<files>] at /usr/bin/xml_pp line 100.

Here's the same thing again but in a nicer list format. It turns out that the version I have installed supports 11 formats out of the box:

$ xml_pp -s 2>&1 | grep -Po '(?<=\[-s \()[^)]*' -o | tr '|' '\n' | nl
     1  none
     2  nsgmls
     3  nice
     4  indented
     5  indented_close_tag
     6  indented_c
     7  wrapped
     8  record_c
     9  record
    10  cvs
    11  indented_a

So let's try them all.

This is our input file:

$ cat in.xml
<a attr="one" bttr="two" tttr="three" fttr="four"/>

And these are all the styles:

$ for STYLE in $(echo "none nsgmls nice indented indented_close_tag indented_c wrapped record_c record cvs indented_a"); do echo; echo "==> Style: xml_pp -s $STYLE <=="; cat in.xml | xml_pp -s $STYLE | tee out.xml_pp.$STYLE.xml; echo; done

==> Style: xml_pp -s none <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s nsgmls <==
<a
attr="one"
bttr="two"
fttr="four"
tttr="three"
/>

==> Style: xml_pp -s nice <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented_close_tag <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s indented_c <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s wrapped <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s record_c <==

<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s record <==

<a attr="one" bttr="two" fttr="four" tttr="three"/>

==> Style: xml_pp -s cvs <==
<a
    attr="one"
    bttr="two"
    fttr="four"
    tttr="three"
/>

==> Style: xml_pp -s indented_a <==
<a
    attr="one"
    bttr="two"
    fttr="four"
    tttr="three"
/>

A bunch of these styles are equivalent for this small input file. They produce the same output:

$ sha256sum * | sort
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4  out.xml_pp.cvs.xml
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4  out.xml_pp.indented_a.xml
8e119bb50bcbf3d72159c96139cf328f46a0de259410acdd344f26e52f033996  out.xml_pp.nsgmls.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617  out.xml_pp.record_c.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617  out.xml_pp.record.xml
e0d13f80ddc48876678c62e407abd3ab1eac8481a82d5aabb1514e24aee4717c  in.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented_close_tag.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented_c.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.indented.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.nice.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.none.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45  out.xml_pp.wrapped.xml

None of these style are exactly what you wanted.

But "cvs" is pretty close. (And "indented_a" produces identical output.)

Afterthoughts: bit dirty

Afterthoughts: Output feels a little dirty.

(a) Some of the files just start with a blank line for no good reason...

$ grep '^$' * -n
out.xml_pp.record_c.xml:1:
out.xml_pp.record.xml:1:

(b) ... and some of the files just have no line terminators at all:

$ file *
in.xml:                            ASCII text
out.xml_pp.cvs.xml:                ASCII text
out.xml_pp.indented_a.xml:         ASCII text
out.xml_pp.indented_close_tag.xml: ASCII text, with no line terminators
out.xml_pp.indented_c.xml:         ASCII text, with no line terminators
out.xml_pp.indented.xml:           ASCII text, with no line terminators
out.xml_pp.nice.xml:               ASCII text, with no line terminators
out.xml_pp.none.xml:               ASCII text, with no line terminators
out.xml_pp.nsgmls.xml:             ASCII text
out.xml_pp.record_c.xml:           ASCII text
out.xml_pp.record.xml:             ASCII text
out.xml_pp.wrapped.xml:            ASCII text, with no line terminators

-- The thing seems to be that xml_pp does not add a trailing newline after the last line. So if you only have ONE line then there will be no newline byte in there. Quite weird.

Looks like this:

$ wc --lines *
  5 out.xml_pp.cvs.xml
  5 out.xml_pp.indented_a.xml
  0 out.xml_pp.indented_close_tag.xml
  0 out.xml_pp.indented_c.xml
  0 out.xml_pp.indented.xml
  0 out.xml_pp.nice.xml
  0 out.xml_pp.none.xml
  5 out.xml_pp.nsgmls.xml
  1 out.xml_pp.record_c.xml
  1 out.xml_pp.record.xml
  0 out.xml_pp.wrapped.xml
 17 total

This here is how I like to add a trailing LF (0x0A byte) if none is present:

$ mkdir 1; mv out.*.xml 1/; cp -r 1/ 2/

$ pcregrep -LMr '\n\Z' 2/ | xargs -n1 --no-run-if-empty -- sed -i -e '$a\' --

$ diff --recursive 1/ 2/ | head
diff --recursive 1/out.xml_pp.cvs.xml 2/out.xml_pp.cvs.xml
6c6
< />
\ No newline at end of file
---
> />
diff --recursive 1/out.xml_pp.indented_a.xml 2/out.xml_pp.indented_a.xml
6c6
< />
\ No newline at end of file

Looks like this afterwards:

$ cd 2/

$ wc --lines *
  6 out.xml_pp.cvs.xml
  6 out.xml_pp.indented_a.xml
  1 out.xml_pp.indented_close_tag.xml
  1 out.xml_pp.indented_c.xml
  1 out.xml_pp.indented.xml
  1 out.xml_pp.nice.xml
  1 out.xml_pp.none.xml
  6 out.xml_pp.nsgmls.xml
  2 out.xml_pp.record_c.xml
  2 out.xml_pp.record.xml
  1 out.xml_pp.wrapped.xml
 28 total

Upvotes: 2

hek2mgl
hek2mgl

Reputation: 157927

xmllint has an option --pretty which supports three levels of prettyness. If this output:

<?xml version="1.0"?>
<a
    attr="one"
    bttr="two"
    tttr="three"
    fttr="four"
/>

is ok for you, then use --pretty 2 :

xmllint --pretty 2 - <<< '<a attr="one" bttr="two" tttr="three" fttr="four"/>'

Upvotes: 3

Related Questions