Perl: Repair utf8 xml file which contains octal or hexadecimal codes

Question

I got from a Linux-server an xml-file on a window10-machine. The file was base64-coded. I decoded the xml with a Perl-script using function decode_base64from MIME::Base64. I tested with a Perl-script if it is well-formed but this was not the case:

C:	est>perl test_well_formed.pl test.xml
test.xml:3: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x6C 0x6C 0x65
M³ller
        ^

I looked at the content. Notepad++ displays the umlaut ü as hexadecimal code



MxFCller

and Emacs displays ü as octal code:



M\374ller

The encoding in Emacs was:

 Its value is ‘utf-8-dos’

Obviously, the hexadecimal and octal-codes are not allowed in utf8 xml.

What I want is:



Müller

My main question is: How can I repair the xml-files?

One solution could be to read with a Perl-script line by line or slurp and replace the hexadecimal code (or octal code?) with the umlaut. Or is there a better way to repair? For example, could umlaute considered when the base64-file is converted?

A second question is: Why does one editor displays octal codes and the other hexadecimal codes?

Here are the screenshots of notepad++ and Emacs:

ikegami · Accepted Answer

You don't have "hex codes" or "octal codes". That's how Notepad++ and Emacs display invalid bytes in the file.

The problem is that this doesn't match the file:

As the message says, you need to specify the correct encoding. For example, if the file is encoded using Windows-1252, you should be using

Another way of making them match, and probably the one that makes the most sense, is to convert the file to use UTF-8.

Inside a Perl script, the following could be used:

use Encode qw( from_to );

from_to( $xml, "Windows-1252", "UTF-8" )

From the command line, this could be done using iconv.

iconv -f Windows-1252 -t UTF-8

Why does one editor displays octal codes and the oder hexadecimal codes?

First of all, it's not a different number.

And because hex was the preferred representation of bytes when Notepad++ was written, octal having been abandoned long before.

Perl: Repair utf8 xml file which contains octal or hexadecimal codes

Answers (1)

Related Questions