Reputation: 10205
I am dealing with a legacy file that has been encoded twice using UTF-8. For example, the codepoint ε
(U+03B5
) should had been encoded as CE B5
but has instead been encoded as C3 8E C2 B5
(CE 8E
is the UTF-8 encoding of U+00CE
, C2 B5
is the UTF-8 encoding of U+00B5
).
The second encoding has been performed assuming the data was encoding in CP-1252.
To go back to the UTF-8 encoding I use the following (seemly wrong) command
iconv --from utf8 --to cp1252 <file.double-utf8 >file.utf8
My problem is that iconv seems unable to convert back some characters. More precisely, iconv is unable to convert characters whose UTF-8 representation contains a character that map to a control character in CP-1252. One examples is the codepoint ρ
(U+03C1
):
CF 81
,CF
is re-encoded to C3 8F
,81
is re-encoded to C2 81
.iconv refuses to convert C2 81
back to 81
, probably because it does not know how to map that control character precisely.
echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to cp1252
�iconv: illegal input sequence at position 2
How can I tell iconv to just perform the mathematical UTF-8 conversion without caring about the mappings?
Upvotes: 2
Views: 3665
Reputation: 10205
The following code uses the low-level encoding functions of Ruby to force the rewriting of double encoded UTF-8 (from CP1525) into normal UTF-8.
#!/usr/bin/env ruby
ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)
prev_b = nil
orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)
orig_bytes.each_with_index do |b, i|
b = b.chr
situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)
if situation == :undefined_conversion
if prev_b != "\xC2"
$stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
exit
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
real_utf8_bytes << b
real_utf8_bytes.force_encoding(Encoding::CP1252)
end
prev_b = b
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes
It is meant to be used in a pipeline:
cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE
Upvotes: 0
Reputation: 9402
echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1
Windows-1252 differs from ISO-8859-1 in the 0x80-0x9F range. For example, in your case, 0x81 is U+0081 in ISO 8859-1, but is invalid in Windows-1252.
Check whether the rest of your data was misinterpreted as Windows-1252 or ISO 8859-1. Usually, ISO 8859-1 is more common.
Upvotes: 2