Reputation:
I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Upvotes: 12
Views: 2323
Reputation: 997
Yesterday, instead of the usual output, I got a cron message saying:
/etc/cron.daily/syslogrotate:
Malformed UTF-8 character (fatal) at /root/bin/maillogstats.pl line 55, <> line 12335.
The log file contained a line rendered in less -N like so:
12335 Apr 29 14:07:46 22 north courieresmtpd: error,relay=***.***.125.226,port=41522,msg="502 ESMTP command error",cmd: ....<A7>.
The script started with these statements:
use utf8;
use open qw(:std :utf8);
I changed it according to a suggestion found in Wikibooks
use utf8;
use open qw(:std :encoding(UTF-8));
And that's it.
Upvotes: 0
Reputation: 28154
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
Upvotes: 2
Reputation: 1690
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.
Upvotes: 1
Reputation: 1499800
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
Upvotes: 1