Jp Morgan
Jp Morgan

Reputation: 181

Bash: format list elements in HTML

I have no bash experience, just want to know how to get started.

I have to write a bash script that properly formats an XHTML document. For example turns this:

   <p>Test</p><ol><li>Test
    </li><li>
    Test</li></ol>

into this:

<p>Test</p>
<ol>
  <li>Test</li>
  <li>Test</li>
</ol>

Now I believe I have to do something like:

cat > format1 #create file
#!bin/bash
if tail of a line ends with "</A-a>": (like </li> or </ol> or </p> or </ul>)
    add \n 
    fi

if head of a line = <ol> or <ul>
    add \n
    fi

Please help me understand it. This is all I can think of and I really would like to know how to solve it.

Upvotes: 10

Views: 395

Answers (5)

ddoxey
ddoxey

Reputation: 2063

Given the constraints that the problem must be solved with a bash script and you cannot use htmltidy, then I'd get started by creating a file htmltidy.sh which contains:

#!/bin/bash

echo $( cat )                       |\
    sed 's/\s*\(<[^>]\+>\)\s*/\1/g' |\
    sed 's/></>\n</g'               |\
    awk '{
        if ( $0 ~ /^<\/[^>]+>$/ ) indent=substr(indent,2);
        print indent$0;
        if ( $0 ~ /^<[^\/>][^>]+>$/ ) indent=indent" ";
    }'

To use this program you'll pipe the content into it like this:

cat sexist.html | ./xhtmltidy.sh

This will at least do the trick given the sample input that you provided.

Some explanation:

  • cat captures all of stdin as a single line of text
  • sed strips leading and trailing space for XHTML tags
  • sed puts a newline between adjacent XHTML tags
  • awk reduces indent if a line is an ending XHTML tag (such as )
  • awk prints the line with indent
  • awk increases indent if a line is an starting XHTML tag (such as )

This toy program will break very quickly as soon as the complexity of the input starts getting more complex. But that will give you some idea why it's better to use an off the shelf utility rather than write your own.

Upvotes: 2

Peter Faller
Peter Faller

Reputation: 132

Another alternative to look into is xmllint, which may be installed on your system:

xmllint --format <input-file>

Upvotes: 0

jadeallencook
jadeallencook

Reputation: 686

HTML Tidy may already be installed on your system, it was for me and I don't ever remember installing it. You might want to check by running -

man tidy 

if you get the manual then you're ready to rock and roll!

tidy -options oldFile.xhtml -output newFile.xhtml

Upvotes: 0

rjv
rjv

Reputation: 6776

Use html-tidy. It would be a good idea to add this to your .bashrc if you wish to use tidy

alias tidy="tidy -xml --indent auto --indent-spaces 1 --quiet yes -im"

The above command creates an alias for tidy that says to indent the file as xml (ensures all tags have closing tags), indent with a single space and modifies the file in place.

Upvotes: 1

gvlasov
gvlasov

Reputation: 20035

I suggest you look at the html-tidy utility.

You don't have to write a formatter yourself, there are a lot of existing utilities that do that for you, let aside it is not a trivial task and "how to implement a html pretty print formatter" would be a really broad question to ask (broad questions are against StackOverflow rules).

Upvotes: 0

Related Questions