Reputation: 143
We have a HTML source which will be processed using a informatica workflow. In between these two we have a Unix script which transforms the file.
We are getting an error from past week in the informatica saying invalid format, because the file has unused html reference (0-8,14-31 etc)
example:
� -  Unused
 -  Unused
 -  Unused
 - Ÿ Unused
We need to handle it in Unix and remove the above mentioned characters from the HTML file before processing it.
I have tried using sed command like
sed -e 's/\&\([^\amp;|^\apos;|^\quot;|^\lt;|^\gt;]\)/\&\1/g'
but it is not serving the purpose. Also, since we have soo many unused reference, it cannot be hardcoded also.
Could you please let me know how to proceed with this?
Upvotes: 0
Views: 79
Reputation: 1385
Here is a working (bash) solution by treating encoded characters as strings. Unclear if your source is encoded or not but works if so :
sed 's/'`for n in {00..08} {11..12} {14..31} {127..159}; do echo -n "&#"$n";\|"; done`'//g'
Upvotes: 1