sam
sam

Reputation: 233

how to convert ascii encoding file to utf-8 encoding in perl?

I want to convert a text file with ascii encoding to utf-8 encoding. So far I have tried this:

open( my $test, ">:encoding(utf-8)", $test_file ) or die("Error: Could not open file!\n");

and ran the below command which is showing the encoding of file

file $test_file
test_file: ASCII text

Please let me know if I am missing something here.

Upvotes: 0

Views: 264

Answers (2)

ikegami
ikegami

Reputation: 386551

You are doing it correctly.

ASCII is a subset of UTF-8.

          decode          encode
ASCII       ⇒   Unicode     ⇒   UTF-8
----------      ----------      ----------
00              U+0000          00
01              U+0001          01
02              U+0002          02
⋮               ⋮               ⋮
7E              U+007E          7E
7F              U+007F          7F
----------      ----------      ----------
ASCII       ⇐   Unicode     ⇐   UTF-8
          encode          decode

As such, an ASCII file is a UTF-8 file.[1]

When you only use that subset, file identifies the file as being encoded using ASCII.

$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdef"'  | file -
/dev/stdin: ASCII text

Going out of that subset causes file to identify the file as text encoded using UTF-8.

$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdéf"' | file -
/dev/stdin: UTF-8 Unicode text

  1. It is also an iso-latin-1 file, iso-latin-2 file, iso-latin-3 file, a cp1250 file, a cp1251 file, a cp1252 file, etc, etc, etc

Upvotes: 3

Dave Cross
Dave Cross

Reputation: 69314

Any file that is in ASCII (i.e. containing only codepoints from 0 to 127) is already in UTF-8. There will be no difference in encoding and, hence, no way for file to identify it as UTF-8.

Differences in encoding only happen with characters with codepoints from 128.

It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

(From the Wikipedia article on UTF-8)

Upvotes: 5

Related Questions