Mustafa Sarıalp
Mustafa Sarıalp

Reputation: 23

How to search a utf8 string in word files using powershell

I created a PowerShell script with assistance from GitHub Copilot. It works well with ASCII characters, but when I try to search for UTF-8 characters, it doesn’t return any results. For example, when I set the $searchWord variable to "YANI" the script performs as expected; however, when I change it to "KOLİ" it fails to find a match. How can I ensure that the script searches using UTF-8 encoding when working with Word files?

# Define the directory to search and the word to search for
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$directoryPath = "D:\BAKIM_ARIZA_TAKIP_FORMU\2024\AGUSTOS_AYI"
$searchWord = "KOLİ"

# Load the Word application
$word = New-Object -ComObject Word.Application
$word.Visible = $false

# Get all .docx files in the directory
$docxFiles = Get-ChildItem -Path $directoryPath -Filter *.doc

foreach ($file in $docxFiles) {
    # Open the document
    $document = $word.Documents.Open($file.FullName)
    
    # Search for the word
    $found = $false
    foreach ($range in $document.StoryRanges) {
        if ($range.Text -match [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord))) {
            $found = $true
            break
        }
    }
    
    # Output the file name if the word is found
    if ($found) {
        Write-Output "Found '$searchWord' in file: $($file.FullName)"
    }
    
    # Close the document
    $document.Close()
}

# Quit the Word application
$word.Quit()

Upvotes: 1

Views: 70

Answers (1)

burnie
burnie

Reputation: 11

You have to re-save your PowerShell script as UTF-8 with BOM, otherwise the PowerShell engine will misinterpret any non-ASCII-range characters (such as İ) in the script.

If you need to use non-Ascii characters in your scripts, save them as UTF-8 with BOM. Without the BOM, Windows PowerShell misinterprets your script as being encoded in the legacy "ANSI" codepage. Conversely, files that do have the UTF-8 BOM can be problematic on Unix-like platforms. Many Unix tools such as cat, sed, awk, and some editors such as gedit don't know how to treat the BOM.

Source reference: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding

Btw, there is no need to explicitly set [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 and no need to encode the string to bytes. You can simply use $range.Text -match $searchWord instead.

Upvotes: 1

Related Questions