Reputation: 1272
I have a python file named 'কাজ.py'. The file does nothing fancy. I am not concerned for that. The thing is when i try to run the file by copying and pasting the file name, it does not show 'কাজ.py' rather it shows some boxes
> python [?][?][?].py
and it raises an error like this
python: can't open file '???.py': [Errno 22] Invalid argument
but on the same console, if i write git add কাজ.py
, it shows
> git add [?][?][?].py
but surprisingly it works and does not give any error.
My question is how come git can take unicode input on the same console where python cannot? Please note that i am on Windows platform and using cmd.exe
Upvotes: 0
Views: 67
Reputation: 148965
It depends whether the command uses internally the UNICODE or MBCS command line application interface. Assuming it is a C (or C++) program, it depends whether it uses a main
or wmain
. If it uses a unicode interface, it will get the true unicode characters (even if it cannot displays them and only displays ?
) and as such will open the correct file. But if it uses the so-called MBCS interface, characters with a code above 255 will be translated in true ?
(character code 0x63) and it will try to open a wrong file.
The difference of behaviour simply proves that your git implementation is unicode compatible while you Python version (I assume 2.x) is not. Untested, but I think that Python 3 is natively Unicode compatible on Windows.
Here is a small C program that demonstrates what happens:
#include <stdio.h>
#include <windows.h>
#include <tchar.h>
int _tmain(int argc, LPTSTR argv[]) {
int i;
_tprintf(_T("Arguments"));
for(i=0; i<argc; i++) {
_tprintf(_T(" >%s<"), argv[i]);
}
_tprintf(_T("\n"));
if (argc > 1) {
LPCTSTR ix = argv[1];
_tprintf(_T("Dump param 1 :"));
while (*ix != 0) {
_tprintf(_T(" %c(%x)"), *ix, ((unsigned int) *ix) & 0xffff);
ix += 1;
}
_tprintf(_T("\n"));
}
return 0;
}
If you call it (by pasting the কাজ
characters in the console) as cmdline কাজ
) you see:
...>cmdline ab???cd
Arguments >cmdline< >ab???cd<
Dump param 1 : a(61) b(62) ?(3f) ?(3f) ?(3f) c(63) d(64)
when built in MBCS mode and
...>cmdline ab???cd
Arguments >cmdline< >ab???cd<
Dump param 1 : a(61) b(62) ?(995) ?(9be) ?(99c) c(63) d(64)
when build in UNICODE mode (the 3 characters কাজ
are respectively U+0995, U+09BE and U+099C in unicode)
As the information is lost in the C run time code that processes the command line arguments, nothing can be done to recover it. So you can only pass to Python3 if you want to be able to use unicode names for your scripts.
Upvotes: 2