chronos
chronos

Reputation: 11

Remove invalid UNICODE characters from XML file in UNIX?

I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:

perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml

and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:

perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml

I have also tried using:

iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml

but that doesn't do anything. It produces an identical file with the same errors.

Is there a unix command that I can use that will completely remove all invalid UNICODE characters?

EDIT: some HEX output (note the 1A's and A0's):

3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70

6D 62 65 72 3E A0 39 34 32 39 38 3C 2F

Upvotes: 1

Views: 1158

Answers (2)

ikegami
ikegami

Reputation: 385764

A0 is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.

A0 is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.

I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin script). It will correctly replace A0 with C2 A0 (the UTF-8 encoding of a non-breaking space).

Combined with your existing script:

perl -i -MEncoding::FixLatin=fix_latin -0777pe'
   $_ = fix_latin($_);
   utf8::decode($_);
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
   utf8::encode($_);
' file.xml

Upvotes: 1

Mons Anderson
Mons Anderson

Reputation: 421

You may use the following onliner:

perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml

You also may extend it with warnings:

perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: @_";""}))' file.xml

Upvotes: 1

Related Questions