Reputation: 11
I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:
perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:
perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
I have also tried using:
iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml
but that doesn't do anything. It produces an identical file with the same errors.
Is there a unix command that I can use that will completely remove all invalid UNICODE characters?
EDIT: some HEX output (note the 1A's and A0's):
3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70
6D 62 65 72 3E A0 39 34 32 39 38 3C 2F
Upvotes: 1
Views: 1158
Reputation: 385764
A0
is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.
A0
is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.
I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin
script). It will correctly replace A0
with C2 A0
(the UTF-8 encoding of a non-breaking space).
Combined with your existing script:
perl -i -MEncoding::FixLatin=fix_latin -0777pe'
$_ = fix_latin($_);
utf8::decode($_);
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
utf8::encode($_);
' file.xml
Upvotes: 1
Reputation: 421
You may use the following onliner:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml
You also may extend it with warnings:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: @_";""}))' file.xml
Upvotes: 1