Powershell Get-ChildItem when path contains french accent characters

I am needing to write a script in Powershell PSVersion 5.1 to discover files where some of the path names contain é. I am using -LiteralPath in the command, but Get-ChildItem does not correctly search for the path. The é is incorrectly interpreted (code page?) and the path is not found. I have tried changing the PS session code page to 65001 (UTF-8?) but to no avail. Much of the reading I have done is about encoding outputs with BOM etc. I am working on a customer provided box, so I may be stuck with the Powershell version, and I do not have elevated permission. Thank you.

I have tried prefixing the path with \\?\, I am using -LiteralPath, and I have tried setting code page to 65001.

Code snip: Get-ChildItem -LiteralPath ('\\?\' + $item.location) -File -Recurse

Response is: Cannot find path. And in the path the é is replaced with �

Note: I'm reading the path in question from a CSV file, using Import-Csv.

Upvotes: 2

Answers (1)

mklement0

Reputation: 438153

Your problem is neither related to the console's active code page (as reported by chcp.com)^[1] nor can it be helped with the use of the long-path opt-in, \\?\.

Your symptom implies that Import-Csv is misinterpreting the character encoding of your CSV file.
- Specifically, it seems that your CSV file is saved with a fixed-width, single-byte encoding, such as that of the system locale's ANSI code page, such as Windows-1252 on US-English systems, whereas Import-Csv expects UTF-8 in the absence of a BOM.^[2]
- In Windows-1252, for instance, the è (LATIN SMALL LETTER E WITH GRAVE, U+00E8) character is encoded as a single byte with value 0xE8 (232), which, when misinterpreted as UTF-8, is an invalid character (because any byte with the high bit set, i.e. with value >= 0x80 (128), must be part of a multi-byte sequence encoding a single non-ASCII char.) and therefore gets replaced with � (REPLACEMENT CHARACTER, U+FFFD) to signal that fact. You can verify this as follows:
```
# -> �
[Text.Encoding]::UTF8.GetString([byte[]] 0xE8)
```
Solution options:
- Either: Pass the appropriate -Encoding argument Import-Csv:
  - Windows PowerShell: use -Encoding Default (which refers to the active ANSI code page).
  - PowerShell (Core) 7:
    - In v7.4+: Use -Encoding ansi
    - In v7.3-: Use -Encoding 1252 (for Windows-1252, specifically, because the abstract ansi identifier isn't supported there).
  - Note: If the actual encoding of the CSV file is not that of the active ANSI code page:
    - Windows PowerShell: You cannot use -Encoding in this case, because it doesn't accept arbitrary encodings; re-save your file as UTF-8 (see below).
    - PowerShell (Core) 7: Use the number of the code page as the -Encoding argument, e.g. -Encoding 1251
- Or: Re-save your CSV file as UTF-8, which ensures that Import-Csv in both PowerShell editions interprets it correctly.
  - If you're using an editor to do so, first visually inspect the file content to ensure that it was read correctly; re-open the file with the actual encoding if needed (which is likely Windows-1252; see below).

^{[1] The reason is that the console code page only comes into play when PowerShell communicates with external programs. PowerShell itself and its native commands have full Unicode support, independently of the code page.}

^{[2] In the context of Windows PowerShell (powershell.exe, the legacy ships-with-Windows version whose latest and last version is 5.1), that Import-Csv defaults to UTF-8 is unusual, not least because its counterpart, Export-Csv defaults to ASCII(!), and that Get-Content and Set-Content as well as the engine itself (when it reads source code) default to ANSI. See the bottom section of this answer for an overview of the wildly inconsistent default character encodings in Windows PowerShell.

By contrast, PowerShell (Core) 7 commendably consistently defaults to (BOM-less) UTF-8.}

Upvotes: 1

Powershell Get-ChildItem when path contains french accent characters

Answers (1)

Related Questions