Reputation: 18046
I am working on a data migration and I am parsing and exporting html into xml. The html gets escaped, of course, when it goes into the xml, but to verify that parsing is happening properly, I am decoding the brackets to get readable html tags inside the xml. However, the tags are all run-together, and it's still not very readable.
Is there something that can simply indent the tag structure that I have? It's neither valid xml nor html. I've tried xmllint --format
and xmllint --htmlout
, but both of those choke at different points.
Can I avoid doing this by hand?
Here is a small example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<result><node><title>This would be the title</title><uri>/path/filename.jpg</uri><alt>Alt tag data</alt><body><p>Some text goes here.</body></node></result>
In the actual data, the html tags inside <body>
are all escaped to <
and >
, but that was too difficult to eyeball to see if the parsing worked correctly. So I changed them to their bracket equivalents with a find and replace. But they are still not indented, so it is difficult to read.
Both tidy
and xmllint
complain about the missing closing <p>
tag. In this data, there are a number of missing or mis-matched tags. I understand that this is not valid html or xml, but the cleanup of the html we'll do later, at this point I just have to make sure that the html is getting parsed at the right places, which is difficult to see when there are no line breaks or indentation.
To fix the above example, I could remove or close the <p>
tag manually, but in the actual data, there is a lot of brokenness, and it would be a non-trivial task to fix tags just to get it to parse for formatting. At this phase I am trying to avoid manual massaging and do things in an automated manner.
For example, for this one file, tidy reports 65 warnings and 778 errors. Fixing them all by hand would be a waste of time -- I might as well start indenting myself. I need something that can indent in a non-strict manner, and is not going to care about unmatched tags.
Upvotes: 1
Views: 1096
Reputation: 6393
I had this problem too recently and wrote my own in Python (3) using BeautifulSoup (v4 +) with some extra wrapping of long lines provided by textwrap.wrap()
:
import sys
from bs4 import BeautifulSoup
from textwrap import wrap
path = f'{sys.argv[1]}'
with open(path) as fp:
for line in wrap(BeautifulSoup(fp).prettify(), replace_whitespace=False):
print(line)
BeutifySoup does a good job at promiscuously interpret most tag-based junk you throw at it. No indentation of tags with this solution, though.
Upvotes: 1
Reputation: 18046
I used the formatting function that user Josh Leitzel posted here. Not perfect, but good enough.
Upvotes: 1
Reputation: 185073
You should try tidy :
$ tidy -h
tidy [option...] [file...] [option...] [file...]
Utility to clean up and pretty print HTML/XHTML/XML
See http://tidy.sourceforge.net/
Your problem is just the <p>
tag, you should remove it :
$ xmllint --format file.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<result>
<node>
<title>This would be the title</title>
<uri>/path/filename.jpg</uri>
<alt>Alt tag data</alt>
<body>Some text goes here.</body>
</node>
</result>
No error.
My thought is to use a tool like html2text
to feed the xml with no html tags, and maybe you can store the indentation of HTML files in XML CTAGS
Upvotes: 1