Reputation: 3658
This is a follow-up question to How to pretty print XML from the command line?.
Is there any tool in libxml2
that will allow me to align the attributes of each node as well? I have a large XML document whose logical structure I cannot change, but I would like to turn
<a attr="one" bttr="two" tttr="three" fttr="four"/>
into
<a attr = "one"
bttr = "two"
tttr = "three"
fttr = "four"
longer = "attribute" />
Upvotes: 4
Views: 1989
Reputation: 3106
xml_pp
with style "-s cvs"You asked for something in libxml2. I don't know about that. But if you are willing to use something else, then read on below.
xml_pp is part of the XML::Twig library and has a bunch of different preconfigured styles.
You can specify a style via the "-s" (style) parameter.
If you just leave "-s" empty, then it will show all available styles. (It actually generate that list on the fly. So it's guaranteed to be fresh.)
$ xml_pp -s
Use of uninitialized value $opt{"style"} in hash element at /usr/bin/xml_pp line 100.
usage: /usr/bin/xml_pp [-v] [-i<extension>] [-s (none|nsgmls|nice|indented|indented_close_tag|indented_c|wrapped|record_c|record|cvs|indented_a)] [-p <tag(s)>] [-e <encoding>] [-l] [-f <file>] [<files>] at /usr/bin/xml_pp line 100.
Here's the same thing again but in a nicer list format. It turns out that the version I have installed supports 11 formats out of the box:
$ xml_pp -s 2>&1 | grep -Po '(?<=\[-s \()[^)]*' -o | tr '|' '\n' | nl
1 none
2 nsgmls
3 nice
4 indented
5 indented_close_tag
6 indented_c
7 wrapped
8 record_c
9 record
10 cvs
11 indented_a
So let's try them all.
This is our input file:
$ cat in.xml
<a attr="one" bttr="two" tttr="three" fttr="four"/>
And these are all the styles:
$ for STYLE in $(echo "none nsgmls nice indented indented_close_tag indented_c wrapped record_c record cvs indented_a"); do echo; echo "==> Style: xml_pp -s $STYLE <=="; cat in.xml | xml_pp -s $STYLE | tee out.xml_pp.$STYLE.xml; echo; done
==> Style: xml_pp -s none <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s nsgmls <==
<a
attr="one"
bttr="two"
fttr="four"
tttr="three"
/>
==> Style: xml_pp -s nice <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s indented <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s indented_close_tag <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s indented_c <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s wrapped <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s record_c <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s record <==
<a attr="one" bttr="two" fttr="four" tttr="three"/>
==> Style: xml_pp -s cvs <==
<a
attr="one"
bttr="two"
fttr="four"
tttr="three"
/>
==> Style: xml_pp -s indented_a <==
<a
attr="one"
bttr="two"
fttr="four"
tttr="three"
/>
A bunch of these styles are equivalent for this small input file. They produce the same output:
$ sha256sum * | sort
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4 out.xml_pp.cvs.xml
452f5c19177d9cc6a54589168dbb1ee790c783a963110662e7dfae170bf997e4 out.xml_pp.indented_a.xml
8e119bb50bcbf3d72159c96139cf328f46a0de259410acdd344f26e52f033996 out.xml_pp.nsgmls.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617 out.xml_pp.record_c.xml
d1ed9a4d1ebf8b9f1d012577809909e91e1ba0fc01b5afc8ff1302ca9dced617 out.xml_pp.record.xml
e0d13f80ddc48876678c62e407abd3ab1eac8481a82d5aabb1514e24aee4717c in.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.indented_close_tag.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.indented_c.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.indented.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.nice.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.none.xml
ea90003eab0ba71936a8a329a87b079b4fb120fe6873d4fa9bc8f986e8654b45 out.xml_pp.wrapped.xml
None of these style are exactly what you wanted.
But "cvs" is pretty close. (And "indented_a" produces identical output.)
Afterthoughts: Output feels a little dirty.
(a) Some of the files just start with a blank line for no good reason...
$ grep '^$' * -n
out.xml_pp.record_c.xml:1:
out.xml_pp.record.xml:1:
(b) ... and some of the files just have no line terminators at all:
$ file *
in.xml: ASCII text
out.xml_pp.cvs.xml: ASCII text
out.xml_pp.indented_a.xml: ASCII text
out.xml_pp.indented_close_tag.xml: ASCII text, with no line terminators
out.xml_pp.indented_c.xml: ASCII text, with no line terminators
out.xml_pp.indented.xml: ASCII text, with no line terminators
out.xml_pp.nice.xml: ASCII text, with no line terminators
out.xml_pp.none.xml: ASCII text, with no line terminators
out.xml_pp.nsgmls.xml: ASCII text
out.xml_pp.record_c.xml: ASCII text
out.xml_pp.record.xml: ASCII text
out.xml_pp.wrapped.xml: ASCII text, with no line terminators
-- The thing seems to be that xml_pp does not add a trailing newline after the last line. So if you only have ONE line then there will be no newline byte in there. Quite weird.
Looks like this:
$ wc --lines *
5 out.xml_pp.cvs.xml
5 out.xml_pp.indented_a.xml
0 out.xml_pp.indented_close_tag.xml
0 out.xml_pp.indented_c.xml
0 out.xml_pp.indented.xml
0 out.xml_pp.nice.xml
0 out.xml_pp.none.xml
5 out.xml_pp.nsgmls.xml
1 out.xml_pp.record_c.xml
1 out.xml_pp.record.xml
0 out.xml_pp.wrapped.xml
17 total
This here is how I like to add a trailing LF (0x0A byte) if none is present:
$ mkdir 1; mv out.*.xml 1/; cp -r 1/ 2/
$ pcregrep -LMr '\n\Z' 2/ | xargs -n1 --no-run-if-empty -- sed -i -e '$a\' --
$ diff --recursive 1/ 2/ | head
diff --recursive 1/out.xml_pp.cvs.xml 2/out.xml_pp.cvs.xml
6c6
< />
\ No newline at end of file
---
> />
diff --recursive 1/out.xml_pp.indented_a.xml 2/out.xml_pp.indented_a.xml
6c6
< />
\ No newline at end of file
Looks like this afterwards:
$ cd 2/
$ wc --lines *
6 out.xml_pp.cvs.xml
6 out.xml_pp.indented_a.xml
1 out.xml_pp.indented_close_tag.xml
1 out.xml_pp.indented_c.xml
1 out.xml_pp.indented.xml
1 out.xml_pp.nice.xml
1 out.xml_pp.none.xml
6 out.xml_pp.nsgmls.xml
2 out.xml_pp.record_c.xml
2 out.xml_pp.record.xml
1 out.xml_pp.wrapped.xml
28 total
Upvotes: 2
Reputation: 157927
xmllint
has an option --pretty
which supports three levels of prettyness. If this output:
<?xml version="1.0"?>
<a
attr="one"
bttr="two"
tttr="three"
fttr="four"
/>
is ok for you, then use --pretty 2
:
xmllint --pretty 2 - <<< '<a attr="one" bttr="two" tttr="three" fttr="four"/>'
Upvotes: 3