Reputation: 23
I created a PowerShell script with assistance from GitHub Copilot. It works well with ASCII characters, but when I try to search for UTF-8 characters, it doesn’t return any results. For example, when I set the $searchWord
variable to "YANI" the script performs as expected; however, when I change it to "KOLİ" it fails to find a match. How can I ensure that the script searches using UTF-8 encoding when working with Word files?
# Define the directory to search and the word to search for
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$directoryPath = "D:\BAKIM_ARIZA_TAKIP_FORMU\2024\AGUSTOS_AYI"
$searchWord = "KOLİ"
# Load the Word application
$word = New-Object -ComObject Word.Application
$word.Visible = $false
# Get all .docx files in the directory
$docxFiles = Get-ChildItem -Path $directoryPath -Filter *.doc
foreach ($file in $docxFiles) {
# Open the document
$document = $word.Documents.Open($file.FullName)
# Search for the word
$found = $false
foreach ($range in $document.StoryRanges) {
if ($range.Text -match [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord))) {
$found = $true
break
}
}
# Output the file name if the word is found
if ($found) {
Write-Output "Found '$searchWord' in file: $($file.FullName)"
}
# Close the document
$document.Close()
}
# Quit the Word application
$word.Quit()
Upvotes: 1
Views: 70
Reputation: 11
You have to re-save your PowerShell script as UTF-8 with BOM, otherwise the PowerShell engine will misinterpret any non-ASCII-range characters (such as İ
) in the script.
If you need to use non-Ascii characters in your scripts, save them as UTF-8 with BOM. Without the BOM, Windows PowerShell misinterprets your script as being encoded in the legacy "ANSI" codepage. Conversely, files that do have the UTF-8 BOM can be problematic on Unix-like platforms. Many Unix tools such as cat, sed, awk, and some editors such as gedit don't know how to treat the BOM.
Source reference: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding
Btw, there is no need to explicitly set [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
and no need to encode the string to bytes. You can simply use $range.Text -match $searchWord
instead.
Upvotes: 1