Reputation: 181
I have no bash experience, just want to know how to get started.
I have to write a bash script that properly formats an XHTML document. For example turns this:
<p>Test</p><ol><li>Test
</li><li>
Test</li></ol>
into this:
<p>Test</p>
<ol>
<li>Test</li>
<li>Test</li>
</ol>
Now I believe I have to do something like:
cat > format1 #create file
#!bin/bash
if tail of a line ends with "</A-a>": (like </li> or </ol> or </p> or </ul>)
add \n
fi
if head of a line = <ol> or <ul>
add \n
fi
Please help me understand it. This is all I can think of and I really would like to know how to solve it.
Upvotes: 10
Views: 395
Reputation: 2063
Given the constraints that the problem must be solved with a bash script and you cannot use htmltidy, then I'd get started by creating a file htmltidy.sh which contains:
#!/bin/bash
echo $( cat ) |\
sed 's/\s*\(<[^>]\+>\)\s*/\1/g' |\
sed 's/></>\n</g' |\
awk '{
if ( $0 ~ /^<\/[^>]+>$/ ) indent=substr(indent,2);
print indent$0;
if ( $0 ~ /^<[^\/>][^>]+>$/ ) indent=indent" ";
}'
To use this program you'll pipe the content into it like this:
cat sexist.html | ./xhtmltidy.sh
This will at least do the trick given the sample input that you provided.
Some explanation:
This toy program will break very quickly as soon as the complexity of the input starts getting more complex. But that will give you some idea why it's better to use an off the shelf utility rather than write your own.
Upvotes: 2
Reputation: 132
Another alternative to look into is xmllint, which may be installed on your system:
xmllint --format <input-file>
Upvotes: 0
Reputation: 686
HTML Tidy may already be installed on your system, it was for me and I don't ever remember installing it. You might want to check by running -
man tidy
if you get the manual then you're ready to rock and roll!
tidy -options oldFile.xhtml -output newFile.xhtml
Upvotes: 0
Reputation: 6776
Use html-tidy
. It would be a good idea to add this to your .bashrc
if you wish to use tidy
alias tidy="tidy -xml --indent auto --indent-spaces 1 --quiet yes -im"
The above command creates an alias
for tidy that says to indent the file as xml (ensures all tags have closing tags), indent with a single space and modifies the file in place.
Upvotes: 1
Reputation: 20035
I suggest you look at the html-tidy utility.
You don't have to write a formatter yourself, there are a lot of existing utilities that do that for you, let aside it is not a trivial task and "how to implement a html pretty print formatter" would be a really broad question to ask (broad questions are against StackOverflow rules).
Upvotes: 0