marlon
marlon

Reputation: 7693

Does PyCharm remote interpreter introduced encoding issue?

I have this piece of code below to report an error:

 def get_name(self, name_dir):
       
        name = {}
        for cnt, line in enumerate(open(name_dir, 'r')):
            id, gname = line.strip().split('\t')
            name[id] = gname
        return name

The error message:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 27: ordinal not in range(128)

I am running it in python3 both locally on Mac and on Ubuntu machines, and there are no errors. However, when I set up a remote interpreter in PyCharm to use the same python3 interpreter on the ubuntu machine, it reports the errors.

I fixed the problem by adding encoding='utf8' to the file 'open' function. Does that mean PyCharm introduced the error while configuring the remote interpreter?

Upvotes: 0

Views: 122

Answers (1)

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9533

No, PyCharm does what it means to do.

The problem is that you tried locally both on Mac and Ubuntu, which it is good, but this hides that your local computer have locales, which probably support UTF-8, but not the remote computer.

Now, by default, most Linux distribution install users with a UTF-8 locale, so I assume one of the two (the second it way more probable from my point of view):

  • you installed Ubuntu with a very tiny image, without many (or any) locales. Or the locale you want to use is not installed, and Ubuntu find just the default locale as fall back (the C locale).

  • or you are using root. Root usually (and for portability) uses just C locale, which supports only ASCII charset.

So, I assume, in your remote machine you are not using a UTF-8 locale. Try to run export LANG=en_US.utf8 (a very common locale), to see if this solve the problem, or just the same string that echo $LANG returns in your local Ubuntu (or Mac). Note: locale -a will list all installed locales (ev. you must install/activate additional locales).

Note: it the reason is that you are using root, you have the difficult decision: either set default root locale to en_US.utf8 or to C.utf8 (on non interactive shell keep C or en US as language, or many scripts will fails).

My personal take: root programs should write just ASCII characters. There are files and logs for UTF-8 data. To write in console, just use a fallback replace or backslashreplace (for security reason, I would never use ignore in root).

In any case, also your code is not ideal, and you find that writing encoding='utf8' solve the problem. Python mantra includes: explicit is better then implicit, so if you know the encoding of name_dir, just use it, instead of letting Python (not PyCharm) to find out the local encoding (which in your case, it seems ASCII). Python uses locale.getpreferredencoding to find default locale.

So, it is not PyCharm, but the problem is in your remote locale (which it is used by default by Python). And I would solve the problem by putting explicitly the encoding on open (and on write). But if you are not running the program as root, I would also check why your remote user is not using by default a UTF-8 locale.

Upvotes: 1

Related Questions