Reputation: 21
I am needing to write a script in Powershell PSVersion 5.1 to discover files where some of the path names contain é
. I am using -LiteralPath in the command, but Get-ChildItem does not correctly search for the path. The é is incorrectly interpreted (code page?) and the path is not found. I have tried changing the PS session code page to 65001 (UTF-8?) but to no avail. Much of the reading I have done is about encoding outputs with BOM etc. I am working on a customer provided box, so I may be stuck with the Powershell version, and I do not have elevated permission. Thank you.
I have tried prefixing the path with \\?\
, I am using -LiteralPath
, and I have tried setting code page to 65001
.
Code snip: Get-ChildItem -LiteralPath ('\\?\' + $item.location) -File -Recurse
Response is: Cannot find path
. And in the path the é
is replaced with �
Note: I'm reading the path in question from a CSV file, using Import-Csv
.
Upvotes: 2
Views: 119
Reputation: 438153
Your problem is neither related to the console's active code page (as reported by chcp.com
)[1] nor can it be helped with the use of the long-path opt-in, \\?\
.
Your symptom implies that Import-Csv
is misinterpreting the character encoding of your CSV file.
Specifically, it seems that your CSV file is saved with a fixed-width, single-byte encoding, such as that of the system locale's ANSI code page, such as Windows-1252 on US-English systems, whereas Import-Csv
expects UTF-8 in the absence of a BOM.[2]
In Windows-1252, for instance, the è
(LATIN SMALL LETTER E WITH GRAVE, U+00E8
) character is encoded as a single byte with value 0xE8
(232
), which, when misinterpreted as UTF-8, is an invalid character (because any byte with the high bit set, i.e. with value >= 0x80
(128
), must be part of a multi-byte sequence encoding a single non-ASCII char.) and therefore gets replaced with �
(REPLACEMENT CHARACTER, U+FFFD
) to signal that fact. You can verify this as follows:
# -> �
[Text.Encoding]::UTF8.GetString([byte[]] 0xE8)
Solution options:
Either: Pass the appropriate -Encoding
argument Import-Csv
:
Windows PowerShell: use -Encoding Default
(which refers to the active ANSI code page).
-Encoding ansi
-Encoding 1252
(for Windows-1252, specifically, because the abstract ansi
identifier isn't supported there).Note: If the actual encoding of the CSV file is not that of the active ANSI code page:
Windows PowerShell: You cannot use -Encoding
in this case, because it doesn't accept arbitrary encodings; re-save your file as UTF-8 (see below).
PowerShell (Core) 7: Use the number of the code page as the -Encoding
argument, e.g. -Encoding 1251
Or: Re-save your CSV file as UTF-8, which ensures that Import-Csv
in both PowerShell editions interprets it correctly.
[1] The reason is that the console code page only comes into play when PowerShell communicates with external programs. PowerShell itself and its native commands have full Unicode support, independently of the code page.
[2] In the context of Windows PowerShell (powershell.exe
, the legacy ships-with-Windows version whose latest and last version is 5.1), that Import-Csv
defaults to UTF-8 is unusual, not least because its counterpart, Export-Csv
defaults to ASCII(!), and that Get-Content
and Set-Content
as well as the engine itself (when it reads source code) default to ANSI. See the bottom section of this answer for an overview of the wildly inconsistent default character encodings in Windows PowerShell.
By contrast, PowerShell (Core) 7 commendably consistently defaults to (BOM-less) UTF-8.
Upvotes: 1