giordano
giordano

Reputation: 3152

Perl: Repair utf8 xml file which contains octal or hexadecimal codes

I got from a Linux-server an xml-file on a window10-machine. The file was base64-coded. I decoded the xml with a Perl-script using function decode_base64from MIME::Base64. I tested with a Perl-script if it is well-formed but this was not the case:

C:\test>perl test_well_formed.pl test.xml
test.xml:3: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x6C 0x6C 0x65
<print>M³ller</print>
        ^

I looked at the content. Notepad++ displays the umlaut ü as hexadecimal code

<?xml version="1.0" encoding="utf-8" ?>
<test>
<print>MxFCller</print>
</test>

and Emacs displays ü as octal code:

<?xml version="1.0" encoding="utf-8" ?>
<test>
<print>M\374ller</print>
</test>

The encoding in Emacs was:

 Its value is ‘utf-8-dos’

Obviously, the hexadecimal and octal-codes are not allowed in utf8 xml.

What I want is:

<?xml version="1.0" encoding="utf-8" ?>
<test>
<print>Müller</print>
</test>

My main question is: How can I repair the xml-files?

One solution could be to read with a Perl-script line by line or slurp and replace the hexadecimal code (or octal code?) with the umlaut. Or is there a better way to repair? For example, could umlaute considered when the base64-file is converted?

A second question is: Why does one editor displays octal codes and the other hexadecimal codes?

Here are the screenshots of notepad++ and Emacs: enter image description here

enter image description here

Upvotes: 2

Views: 176

Answers (1)

ikegami
ikegami

Reputation: 385754

You don't have "hex codes" or "octal codes". That's how Notepad++ and Emacs display invalid bytes in the file.

The problem is that this doesn't match the file:

<?xml version="1.0" encoding="utf-8"?>

As the message says, you need to specify the correct encoding. For example, if the file is encoded using Windows-1252, you should be using

<?xml version="1.0" encoding="Windows-1252"?>

Another way of making them match, and probably the one that makes the most sense, is to convert the file to use UTF-8.

Inside a Perl script, the following could be used:

use Encode qw( from_to );

from_to( $xml, "Windows-1252", "UTF-8" )

From the command line, this could be done using iconv.

iconv -f Windows-1252 -t UTF-8

Why does one editor displays octal codes and the oder hexadecimal codes?

First of all, it's not a different number.

And because hex was the preferred representation of bytes when Notepad++ was written, octal having been abandoned long before.

Upvotes: 3

Related Questions