HockChai Lim
HockChai Lim

Reputation: 1713

using powershell to replace extended ascii character in a text file

I'm needing to replace a hex 93 character to a "" string inside several csv files. Below is the code that I'm using. But it is not working I think the reason that it does not work is because the hex value is greater than 7F (Dec 127). I've tried several other methods to no avail. Any help would be appreciated.

$q1 = [String](0x93 -as [char])
Get-ChildItem ".\*.csv" -Recurse | ForEach {
(Get-Content $_ | ForEach  { $_.replace($q1, '""') }) |
     Set-Content $_
}

Note: Attach is a image of the format-hex dump of my test file. The first character is the one that I need to perform the replace on: enter image description here

Upvotes: 8

Views: 15310

Answers (1)

mklement0
mklement0

Reputation: 437568

In Windows PowerShell, the default character encoding when reading from / writing to[1] files is "ANSI", i.e., the legacy 8-bit code page implied by the active system locale.
(By contrast, PowerShell Core defaults to UTF-8.)

For instance, the code page associated with the system locale on an US-English system is 1252, i.e., Windows-1252, where code point 0x93 is the non-ASCII quotation mark.

Howere, once a text file's content has been read into memory, in memory a string's characters are represented as UTF-16LE code units, i.e., as .NET [string] instances.

As a Unicode character, has code point U+201c, expressed as 0x201c in UTF-16LE.

Therefore - because in memory all strings are UTF-16LE code units - what you need to replace is [char] 0x201c:

$q1 = [char] 0x201c  # “
Get-ChildItem *.csv -Recurse | ForEach-Object {
  (Get-Content $_.FullName) -replace $q1, '""' | Set-Content $_.FullName
}

Note that Set-Content too uses the default character encoding, so the rewritten files will use "ANSI" encoding too - use the -Encoding parameter to change the output encoding, if desired.

Also note the (...) around the Get-Content call, which ensures that the input file i read into memory in full up front, which enables writing back to the same file in the same pipeline.
While this approach is convenient, note that it bears a slight risk of data loss if writing back to the input file is interrupted before completion.


Converting an "ANSI" code point to a Unicode code point

The following shows how an "ANSI" (8-bit) code point such as 0x93 can be converted to its equivalent UTF-16 code point, 0x201c:

# Convert an array of "ANSI" code points (1 byte each) to the UTF-16
# string they represent. 
# Note: In Windows PowerShell, [Text.Encoding]::Default contains
#       the "ANSI" encoding set by the system locale.
$str = [Text.Encoding]::Default.GetString([byte[]] 0x93) # -> '“'

# Get the UTF-16 code points of the characters making up the string.
$codePoints = [int[]] [char[]] $str

# Format the first and only code point as a hex. number.
'0x{0:x}' -f $codePoints[0]  # -> '0x201c'

[1] Writing files with Set-Content, that is; using Out-File / >, by contrast, creates UTF-16LE ("Unicode") files. The cmdlets in Windows PowerShell display a bewildering array of differing encodings: see this answer. Fortunately, PowerShell Core now consistently defaults to (BOM-less) UTF-8.

Upvotes: 14

Related Questions