Reputation: 5343
When I open cmd.exe
on Windows, what encoding is it using?
How can I check which encoding it is currently using?
Does it depend on my regional setting or are there any environment
variables to check?
What happens when you type a file with a certain encoding?
Sometimes I get garbled characters (because of incorrect encoding) and
sometimes it kind of works.
However,
I don't trust anything as long as I don't know what's going on.
Can anyone explain?
Upvotes: 306
Views: 302856
Reputation: 5767
cmd.exe
to use ANSI encoding by defaultDISCLAIMER. Following any suggestion here is at your own risk.
Create and run a .reg
file with a suitable name :
1
Windows Registry Editor Version 5.00
;; https://stackoverflow.com/a/75788701
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Command Processor]
"Autorun"="C:\\Windows\\System32\\chcp.com 1252"
In case you later change your mind – here is a
CMD-CodePage-1252-Restore.reg
file :
Windows Registry Editor Version 5.00
;; https://stackoverflow.com/a/75788701
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Command Processor]
"Autorun"=-
When I open
cmd.exe
on Windows, what encoding is it using?
– By default, cmd.exe
uses code page 437.
This, in my opinion, is an awful choice.
I suggest instead using your language's ANSI code page.
– It's compatible with the ANSI encoding in Microsoft's own
native text editor C:\WINDOWS\System32\notepad.exe
.
For Western European languages, ANSI means code page 1252,
or Windows-1252 (CP-1252).
For other language groups, I've posted a table at the end of
this answer.
How can I check which encoding it is currently using?
– Run C:\WINDOWS\System32\chcp.com
:
C:\>chcp
Active code page: 1252
The reason it responds 1252
instead of 437
in my case, is that
I've deliberately set cmd.exe
to use 1252 by default.
As described in my "short" answer above.
Does it depend on my regional setting or are there any environment variables to check?
– Neither. What's relevant in this context is the language.
I tried the following:
WinKey + i > Time & Language > Language >
Preferred languages > Add a language.
I added Swedish (Sweden), and then made sure that under
Windows display language, Swedish was chosen.
Finally, I restarted my computer, opened cmd.exe
, typed chcp
and pressed Enter.
The response was Active code page: 437
.
Thus, although the Windows display language changes the language of
Windows, it doesn't seem to affect the code page that cmd.exe
uses.
2
What happens when you type a file with a certain encoding?
Sometimes I get garbled characters (because of incorrect encoding) and sometimes it kind of works.
– Yes. That's exactly what you should expect.
As an example, I have a file, Some-ANSI-chars.txt
which
contains the Swedish letters å
and ä
, encoded with the code page
1252, ANSI encoded.
When I type
the file in cmd.exe
, the Swedish letters are correctly
printed :
C:\stackexchange\stackoverflow\Char-encoding>type Some-ANSI-chars.txt
Sakta men säkert vinner basinkomst mark,
och det viktigaste just nu är att hålla ihop.
But when I make a copy of the file, and convert it to UTF-8, for every (non-ASCII) Swedish letter, two garbled characters are printed :
C:\stackexchange\stackoverflow\Char-encoding>type Some-UTF-8-chars.txt
Sakta men säkert vinner basinkomst mark,
och det viktigaste just nu är att hålla ihop.
As you can see, the two UTF-8 encoded characters å
and ä
use
two bytes each.
The type
command decodes å
and ä
to display two
nonsensical one-byte characters each, namely å
and ä
.
To me, this is not a problem, as I hardly ever type
the contents
of my text files.
The only thing that matters, is that my favorite text editor is set to
decode my files as UTF-8.
3
However, I don't trust anything as long as I don't know what's going on.
– That's wise of you.
It protects you (I hope) from falling into the trap of using UTF-8
encoding in cmd.exe
, the culprit being code page 65001.
cmd.exe
chcp 65001
provides some UTF-8 decoding but it's very
rudimentary
and doesn't provide proper input (2016).chcp 65001
is very dangerous (2017).chcp.com 65001
, except temporarily in batch
scripts (2019).In case you desperately want a command-line tool that correctly outputs the text of your UTF-8 encoded files, I suggest that you download and install Linux style MSYS2, which by default assumes that your text files are UTF-8 encoded.
Note that, while your UTF-8 characters are all correctly rendered :
$ cat Some-UTF-8-chars.txt
Sakta men säkert vinner basinkomst mark,
och det viktigaste just nu är att hålla ihop.
the (non-ASCII) ANSI characters will instead be output as question marks :
$ cat Some-ANSI-chars.txt
Sakta men s�kert vinner basinkomst mark,
och det viktigaste just nu �r att h�lla ihop.
In conclusion, cmd.exe
correctly outputs ANSI encoded files,
4
while the MSYS2 terminal correctly outputs UTF-8 encoded files.
"Autorun"="chcp 1252>>nul"
in the
registrychcp 65001
provides some UTF-8 decoding but it's very
rudimentarychcp 65001
is very dangerouschcp.com 65001
1
The .reg
file is inspired by this answer.
I trust that you know how to achieve the same thing manually in the
registry.
It's wise to first check the registry to see if you already have a
REG_SZ value by the name Autorun
.
The registry hack doesn't affect PowerShell. – Open PowerShell and
run chcp
. Expect to see Active code page: 437
.
Of course, I use code page 1252 in my .bat
files as well.
About 99% of them are pure ASCII files anyway.
2
When doing this experiment, I made sure there was no Autorun
value
under HKLM\SOFTWARE\Microsoft\Command Processor
in the registry.
3
To be precise, I have no less than three "favorite" text editors,
Notepad2, Notepad++,
and Visual Studio Code.
Of these three, Visual Studio Code is set to encode all files by
UTF-8, while Notepad2 and Notepad++
auto-detect whatever encoding the editor thinks is correct.
4 Provided you've adopted the registry hack in my "short answer".
Upvotes: 6
Reputation: 577
You can control the code page simply by creating a file %HOMEPATH%\init.cmd.
Mine says:
@ECHO OFF
CHCP 65001 > nul
Upvotes: 3
Reputation: 34803
Yes, it’s frustrating—sometimes type
and other programs
print gibberish, and sometimes they do not.
First of all, Unicode characters will only display if the current console font contains the characters. So use a TrueType font like Lucida Console instead of the default Raster Font.
But if the console font doesn’t contain the character you’re trying to display, you’ll see question marks instead of gibberish. When you get gibberish, there’s more going on than just font settings.
When programs use standard C-library I/O functions like printf
, the
program’s output encoding must match the console’s output encoding, or
you will get gibberish. chcp
shows and sets the current codepage. All
output using standard C-library I/O functions is treated as if it is in the
codepage displayed by chcp
.
Matching the program’s output encoding with the console’s output encoding can be accomplished in two different ways:
A program can get the console’s current codepage using chcp
or
GetConsoleOutputCP
, and configure itself to output in that encoding, or
You or a program can set the console’s current codepage using chcp
or
SetConsoleOutputCP
to match the default output encoding of the program.
However, programs that use Win32 APIs can write UTF-16LE strings directly
to the console with
WriteConsoleW
.
This is the only way to get correct output without setting codepages. And
even when using that function, if a string is not in the UTF-16LE encoding
to begin with, a Win32 program must pass the correct codepage to
MultiByteToWideChar
.
Also, WriteConsoleW
will not work if the program’s output is redirected;
more fiddling is needed in that case.
type
works some of the time because it checks the start of each file for
a UTF-16LE Byte Order Mark
(BOM), i.e. the bytes 0xFF 0xFE
.
If it finds such a
mark, it displays the Unicode characters in the file using WriteConsoleW
regardless of the current codepage. But when type
ing any file without a
UTF-16LE BOM, or for using non-ASCII characters with any command
that doesn’t call WriteConsoleW
—you will need to set the
console codepage and program output encoding to match each other.
How can we find this out?
Here’s a test file containing Unicode characters:
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
Here’s a Java program to print out the test file in a bunch of different
Unicode encodings. It could be in any programming language; it only prints
ASCII characters or encoded bytes to stdout
.
import java.io.*;
public class Foo {
private static final String BOM = "\ufeff";
private static final String TEST_STRING
= "ASCII abcde xyz\n"
+ "German äöü ÄÖÜ ß\n"
+ "Polish ąęźżńł\n"
+ "Russian абвгдеж эюя\n"
+ "CJK 你好\n";
public static void main(String[] args)
throws Exception
{
String[] encodings = new String[] {
"UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };
for (String encoding: encodings) {
System.out.println("== " + encoding);
for (boolean writeBom: new Boolean[] {false, true}) {
System.out.println(writeBom ? "= bom" : "= no bom");
String output = (writeBom ? BOM : "") + TEST_STRING;
byte[] bytes = output.getBytes(encoding);
System.out.write(bytes);
FileOutputStream out = new FileOutputStream("uc-test-"
+ encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));
out.write(bytes);
out.close();
}
}
}
}
The output in the default codepage? Total garbage!
Z:\andrew\projects\sx\1259084>chcp
Active code page: 850
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
= bom
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
== UTF-16LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
= bom
■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
== UTF-16BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
== UTF-32LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
== UTF-32BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
= bom
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
However, what if we type
the files that got saved? They contain the exact
same bytes that were printed to the console.
Z:\andrew\projects\sx\1259084>type *.txt
uc-test-UTF-16BE-bom.txt
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16LE-bom.txt
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
uc-test-UTF-16LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
uc-test-UTF-32BE-bom.txt
■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32LE-bom.txt
A S C I I a b c d e x y z
G e r m a n ä ö ü Ä Ö Ü ß
P o l i s h ą ę ź ż ń ł
R u s s i a n а б в г д е ж э ю я
C J K 你 好
uc-test-UTF-32LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
uc-test-UTF-8-bom.txt
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
uc-test-UTF-8-nobom.txt
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
The only thing that works is UTF-16LE file, with a BOM, printed to the
console via type
.
If we use anything other than type
to print the file, we get garbage:
Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON
■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
1 file(s) copied.
From the fact that copy CON
does not display Unicode correctly, we can
conclude that the type
command has logic to detect a UTF-16LE BOM at the
start of the file, and use special Windows APIs to print it.
We can see this by opening cmd.exe
in a debugger when it goes to type
out a file:
After type
opens a file, it checks for a BOM of 0xFEFF
—i.e., the bytes
0xFF 0xFE
in little-endian—and if there is such a BOM, type
sets an
internal fOutputUnicode
flag. This flag is checked later to decide
whether to call WriteConsoleW
.
But that’s the only way to get type
to output Unicode, and only for files
that have BOMs and are in UTF-16LE. For all other files, and for programs
that don’t have special code to handle console output, your files will be
interpreted according to the current codepage, and will likely show up as
gibberish.
You can emulate how type
outputs Unicode to the console in your own programs like so:
#include <stdio.h>
#define UNICODE
#include <windows.h>
static LPCSTR lpcsTest =
"ASCII abcde xyz\n"
"German äöü ÄÖÜ ß\n"
"Polish ąęźżńł\n"
"Russian абвгдеж эюя\n"
"CJK 你好\n";
int main() {
int n;
wchar_t buf[1024];
HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
n = MultiByteToWideChar(CP_UTF8, 0,
lpcsTest, strlen(lpcsTest),
buf, sizeof(buf));
WriteConsole(hConsole, buf, n, &n, NULL);
return 0;
}
This program works for printing Unicode on the Windows console using the default codepage.
For the sample Java program, we can get a little bit of correct output by setting the codepage manually, though the output gets messed up in weird ways:
Z:\andrew\projects\sx\1259084>chcp 65001
Active code page: 65001
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
ж эюя
CJK 你好
你好
好
�
= bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
еж эюя
CJK 你好
你好
好
�
== UTF-16LE
= no bom
A S C I I a b c d e x y z
…
However, a C program that sets a Unicode UTF-8 codepage:
#include <stdio.h>
#include <windows.h>
int main() {
int c, n;
UINT oldCodePage;
char buf[1024];
oldCodePage = GetConsoleOutputCP();
if (!SetConsoleOutputCP(65001)) {
printf("error\n");
}
freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);
n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
fwrite(buf, sizeof(buf[0]), n, stdout);
SetConsoleOutputCP(oldCodePage);
return 0;
}
does have correct output:
Z:\andrew\projects\sx\1259084>.\test
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
The moral of the story?
type
can print UTF-16LE files with a BOM regardless of your current codepageWriteConsoleW
.chcp
, and will probably still get weird output.Upvotes: 447
Reputation: 39
In Java I used encoding "IBM850" to write the file. That solved the problem.
Upvotes: 2
Reputation: 1423
I've been frustrated for long by Windows code page issues, and the C programs portability and localisation issues they cause. The previous posts have detailed the issues at length, so I'm not going to add anything in this respect.
To make a long story short, eventually I ended up writing my own UTF-8 compatibility library layer over the Visual C++ standard C library. Basically this library ensures that a standard C program works right, in any code page, using UTF-8 internally.
This library, called MsvcLibX, is available as open source at https://github.com/JFLarvoire/SysToolsLib. Main features:
More details in the MsvcLibX README on GitHub, including how to build the library and use it in your own programs.
The release section in the above GitHub repository provides several programs using this MsvcLibX library, that will show its capabilities. Ex: Try my which.exe tool with directories with non-ASCII names in the PATH, searching for programs with non-ASCII names, and changing code pages.
Another useful tool there is the conv.exe program. This program can easily convert a data stream from any code page to any other. Its default is input in the Windows code page, and output in the current console code page. This allows to correctly view data generated by Windows GUI apps (ex: Notepad) in a command console, with a simple command like: type WINFILE.txt | conv
This MsvcLibX library is by no means complete, and contributions for improving it are welcome!
Upvotes: 8
Reputation: 1730
Type
chcp
to see your current code page (as Dewfy already said).
Use
nlsinfo
to see all installed code pages and find out what your code page number means.
You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo
.
Upvotes: 39
Reputation: 23614
Command CHCP shows the current codepage. It has three digits: 8xx and is different from Windows 12xx. So typing a English-only text you wouldn't see any difference, but an extended codepage (like Cyrillic) will be printed wrongly.
Upvotes: 6