Russell C.
Russell C.

Reputation: 1659

Perl Regex Error Help

I'm receiving a similar error in two completely unrelated places in our code that we can't seem to figure out how to resolve. The first error occurs when we try to parse XML using XML::Simple:

Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/local/lib/perl5/XML/LibXML/Error.pm line 217.

And the second is when we try to do simple string substitution:

Malformed UTF-8 character (unexpected non-continuation byte 0x78, immediately after start byte 0xe9) in substitution (s///) at /gold/content/var/www/alltrails.com/cgi-bin/API/Log.pm line 365.

The line in question in our Log.pm file is as follows where $message is a string:

$message =~ s/\s+$//g;

Our biggest problem in troubleshoot this is that we haven't found a way to identify the input that is causing this to occur. My hope is that some else has run into this issue before and can provide advice or sample code that will help us resolve it.

Thanks in advance for your help!

Upvotes: 0

Views: 1171

Answers (3)

ysth
ysth

Reputation: 98388

Sounds like you have an "XML" file that is expected to have UTF-8 encoded characters but doesn't. Try just opening it and looking for hibit characters.

Upvotes: 0

Brad
Brad

Reputation: 11505

Can you do a hex dump of the source data to see what it looks like?

If your reading this from a file, you can do this with a tool like "od".

Or, you can do this inside the perl script itself by passing the string to a function like this:

sub DumpString {
    my @a = unpack('C*',$_[0]);
    my $o = 0;
    while (@a) {
        my @b = splice @a,0,16;
        my @d = map sprintf("%03d",$_), @b;
        my @x = map sprintf("%02x",$_), @b;
        my $c = substr($_[0],$o,16);
        $c =~ s/[[:^print:]]/ /g;
        printf "%6d %s\n",$o,join(' ',@d);
        print " "x8,join('  ',@x),"\n";
        print " "x9,join('   ',split(//,$c)),"\n";
        $o += 16;
    }
}

Upvotes: 1

zigdon
zigdon

Reputation: 15063

Not sure what the cause is, but if you want to log the message that is causing this, you could always add a __DIE__ signal handler to make sure you capture the error:

$SIG{__DIE__} = sub { 
  if ($_[0] =~ /Malformed UTF-8 character/) { 
    print STDERR "message = $message\n"; 
  } 
};

That should at least let you know what string is triggering these errors.

Upvotes: 3

Related Questions