sekar
sekar

Reputation:

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?

Upvotes: 12

Views: 2323

Answers (4)

Ale
Ale

Reputation: 997

Yesterday, instead of the usual output, I got a cron message saying:

/etc/cron.daily/syslogrotate:
Malformed UTF-8 character (fatal) at /root/bin/maillogstats.pl line 55, <> line 12335.

The log file contained a line rendered in less -N like so:

 12335 Apr 29 14:07:46 22 north courieresmtpd: error,relay=***.***.125.226,port=41522,msg="502 ESMTP command error",cmd: ....<A7>.

The script started with these statements:

use utf8;
use open qw(:std :utf8);

I changed it according to a suggestion found in Wikibooks

use utf8;
use open qw(:std :encoding(UTF-8));

And that's it.

Upvotes: 0

Constantin
Constantin

Reputation: 28154

Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.

import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')

This prints:

spam  eggs bacon

Upvotes: 2

Mike G.
Mike G.

Reputation: 1690

RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).

But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1499800

EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)

One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.

Upvotes: 1

Related Questions