how to remove unused html codes from the file using Unix

Question

We have a HTML source which will be processed using a informatica workflow. In between these two we have a Unix script which transforms the file.

We are getting an error from past week in the informatica saying invalid format, because the file has unused html reference (0-8,14-31 etc)

example:

� -    Unused
 -    Unused
 -    Unused
 - Ÿ Unused

We need to handle it in Unix and remove the above mentioned characters from the HTML file before processing it.

I have tried using sed command like

sed -e 's/\&$[^\amp;|^\apos;|^\quot;|^\lt;|^\gt;]$/\&\1/g'

but it is not serving the purpose. Also, since we have soo many unused reference, it cannot be hardcoded also.

Could you please let me know how to proceed with this?

svante · Accepted Answer

Here is a working (bash) solution by treating encoded characters as strings. Unclear if your source is encoded or not but works if so :

sed 's/'`for n in {00..08} {11..12} {14..31} {127..159}; do echo -n "&#"$n";\|"; done`'//g'

Answers (1)