Hannah Banana
Hannah Banana

Reputation: 53

Powershell for Matching and Replacing Partially Matching Patterns

Been going crazy all week unable to solve this issue. I have a dictionary word file that will be a few million words at one point, for now let's assume it's just a text file "Words.txt" which has:

App
Apple
Application
Bar
Bat
Batter
Cap
Capital
Candy

What I need it to do is to match each string against the rest of the file and only write output of the first hit. This will be alphabetical.

Example the desired output from the words above would be:

App - due to pattern "App" being seen first and skips "Apple" and "Application
Bar - due to pattern "Bar", unique
Bat - due to pattern "Bat" being seen first and skips "Batter"
Cap - due to pattern "Cap" being seen first and skips "Capital"
Candy - due to pattern "Candy", unique

What I absolutely cannot figure out how to do it is how to ignore matches that happen after initial hit and move to a 'new' pattern. It would be ok if other redundant patters are overwritten or just skipped, doesnt matter how.

I have a script to match patterns but I dont know how to end up with the desired output :( Any help?!?!


$Words = "C:\Words.txt"

[System.Collections.ArrayList]$WordList = Get-Content $Words

$Words
$Words2 = $Words
$i = 0
$r = 0
Foreach ($item in $Words)
{
    foreach ($item2 in $Words2)
    {
            if ($item2 -like "$item*")
            {
            write-host $("Match " + [string]$i + " " + $item + " " + [string]$r + " " + $item2)
            }

            $r++
    }
$i++
} 

Upvotes: 1

Views: 100

Answers (1)

mklement0
mklement0

Reputation: 437218

It's sufficient to process the lines one by one and compare them to the most recent unique prefix:

$prefix = '' # initialize the prefix pattern
foreach ($line in [IO.File]::ReadLines('C:\Words.txt')) {
  if ($line -like $prefix) { continue } # same prefix, skip
  $line               # output new unique prefix
  $prefix = "$line*"  # save new prefix pattern
}

Note: Since you mention the input file being large, I'm using System.IO.File.ReadLines rather than Get-Content to read the file, for superior performance.

Note: Your sample input path is a full path anyway, but be sure to always pass full paths to .NET methods, because .NET's working directory usually differs from PowerShell's.

If you wrap the foreach loop in & { ... }, you can pipe the result in streaming fashion (line by line, without collecting all results in memory first) to Set-Content.

However, using a .NET type for saving as well will perform much better - see the bottom section of this answer.

Upvotes: 1

Related Questions