Reputation: 18046

pretty-printing malformed xml

I am working on a data migration and I am parsing and exporting html into xml. The html gets escaped, of course, when it goes into the xml, but to verify that parsing is happening properly, I am decoding the brackets to get readable html tags inside the xml. However, the tags are all run-together, and it's still not very readable.

Is there something that can simply indent the tag structure that I have? It's neither valid xml nor html. I've tried xmllint --format and xmllint --htmlout, but both of those choke at different points.

Can I avoid doing this by hand?

Here is a small example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<result><node><title>This would be the title</title><uri>/path/filename.jpg</uri><alt>Alt tag data</alt><body><p>Some text goes here.</body></node></result>

In the actual data, the html tags inside <body> are all escaped to < and >, but that was too difficult to eyeball to see if the parsing worked correctly. So I changed them to their bracket equivalents with a find and replace. But they are still not indented, so it is difficult to read.

Both tidy and xmllint complain about the missing closing <p> tag. In this data, there are a number of missing or mis-matched tags. I understand that this is not valid html or xml, but the cleanup of the html we'll do later, at this point I just have to make sure that the html is getting parsed at the right places, which is difficult to see when there are no line breaks or indentation.

To fix the above example, I could remove or close the <p> tag manually, but in the actual data, there is a lot of brokenness, and it would be a non-trivial task to fix tags just to get it to parse for formatting. At this phase I am trying to avoid manual massaging and do things in an automated manner.

For example, for this one file, tidy reports 65 warnings and 778 errors. Fixing them all by hand would be a waste of time -- I might as well start indenting myself. I need something that can indent in a non-strict manner, and is not going to care about unmatched tags.

Upvotes: 1

Answers (3)

Jacob Oscarson

Reputation: 6393

I had this problem too recently and wrote my own in Python (3) using BeautifulSoup (v4 +) with some extra wrapping of long lines provided by textwrap.wrap():

   import sys
   from bs4 import BeautifulSoup
   from textwrap import wrap

   path = f'{sys.argv[1]}'

   with open(path) as fp:
       for line in wrap(BeautifulSoup(fp).prettify(), replace_whitespace=False):
           print(line)

BeutifySoup does a good job at promiscuously interpret most tag-based junk you throw at it. No indentation of tags with this solution, though.

Upvotes: 1

user151841

Reputation: 18046

I used the formatting function that user Josh Leitzel posted here. Not perfect, but good enough.

Upvotes: 1

Gilles Quénot

Reputation: 185073

You should try tidy :

$ tidy -h
tidy [option...] [file...] [option...] [file...]
Utility to clean up and pretty print HTML/XHTML/XML

See http://tidy.sourceforge.net/

Edit

Your problem is just the <p> tag, you should remove it :

$ xmllint --format file.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<result>
  <node>
    <title>This would be the title</title>
    <uri>/path/filename.jpg</uri>
    <alt>Alt tag data</alt>
    <body>Some text goes here.</body>
  </node>
</result>

No error.

Edit 2

My thought is to use a tool like html2text to feed the xml with no html tags, and maybe you can store the indentation of HTML files in XML CTAGS

Upvotes: 1

pretty-printing malformed xml

Answers (3)

Edit

Edit 2

Related Questions