Reputation: 1436

Erlang: read a text file with non-English characters

I am trying to use file:consult/1 to read a file of Erlang terms. However, the file contains some non-English characters in strings. So, when I read the file, those strings are displayed as a list of numbers.

Anyone know how I can read such a file and print out the foreign chars?

I have tried the followings in the shell:

ets:new(myTable, [bag,named_table]).
ets:insert(myTable, {"some_funny_chars"}).

The result is that it gets saved as a list of integers and therefore when I try to do things like ets:lookup() it also gives me back a list on the shell. I want to see "some_funny_chars" !

Hope it makes sense.

Upvotes: 1

Answers (2)

zxq9

Reputation: 13154

The basic principle you have to keep in your mind is that you are seeing Unicode already, all the time. Unicode is strings of numbers, and without any special instruction the shell will just show you that: strings of numbers.

You can use io:format/2 to show Unicode the way you expect (if your terminal can print the characters, that is) by changing from

io:format("Print a term: ~p~n", [Term])

io:format("Print a Unicode term: ~tp~n", [UnicodeTerm])

There are some basic encoding things that can be useful when dealing with Unicode files as data (I'm not sure about file:consult/1 getting Erlang terms, though). Here is a stub module you can build on for doing file_read and file_write:

%% Beginnings of a utf8 file I/O module
%% -*- coding: utf8 -*-

-module(u_file).
-export([write_file/2, read_file/1]).

write_file(Filename, UTF8_data) ->
    file:write_file(Filename, unicode:characters_to_binary(UTF8_data, utf8)).

read_file(Filename) ->
    case file:read_file(Filename) of
        {ok, Data} -> {ok, unicode:characters_to_list(Data, utf8)};
        Other -> Other
    end.

I'm not sure what you need to see from your ETS tables, but if it is just checking values in the shell then you simply need to switch from the ~p term substitution to the ~tp unicode term substitution. Actually, using ~tp everywhere is not a bad idea, as it works exactly the same way ~p does on other data (ASCII being a subset of UTF-8 is convenient!).

Hopefully this gets you closer to a solution. Whatever the case, I strongly recommend that every Erlanger read the "Using Unicode in Erlang" part of the docs.

Upvotes: 2

Nathaniel Waisbrot

Reputation: 24483

In Erlang, all strings are lists of numbers. The REPL tries to be helpful by displaying an ASCII string when it thinks that's what it has and a list of numbers when it doesn't, but this is just a display feature.

If you're writing the strings back to a file or comparing them in memory, you should be OK to treat all your strings the same. The foreign chars will be ugly to look at when debugging, but they should read and write correctly. I'm not sure if things are as easy if you need to store the strings in an external database or send them over the wire to some other service. At that point, you'll probably need to handle encoding yourself.

For a better time, though, note that UTF8 is standard in Erlang/OTP 17.0 and beyond. This means that if your file is UTF8 format and you're using Erlang 17, everything will work great!

Upvotes: 1

Erlang: read a text file with non-English characters

Answers (2)

Related Questions